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Abstract 

The implication problem for database constraints is central in the fields of automated schema 
design and query optimization and has been traditionally approached with resolution-based 
techniques. We present a novel approach to database constraints, u$ing equations instead of Horn 
clauses. This formulation enables us to use new techniques for database theory, which derive from 
universal algebra, cquational logic and lattice theory. It also points to the possibility of employing 
theorem -proving techniques originally developed for cquational theories to deal with implication in 
the context of logical databases. 

We apply our approach to study functional and inclusion dependencies. These constraints can model 
functional determination and data duplication and they have been extensively proposed as a 
powerful and realistic feature for semantic data models. Wc prove completeness of new proof 
procedures and we derive new upper and lower bounds for the complexity of various implication 
problems involving these dependencies. 

Wc also present a new class of constraints which are defined cquationally, using algebraic operations 
on set- theoretic partitions. These partition dependencies provide an elegant generalization of 
functional dependencies (in the direction of incorporating transitive closure), for which the 
implication problem remains efficiently solvable. 

Thesis Co-Supervisor: Paris C. Kancllakis, Visiting Assistant Professor of Computer Science 
(on leave from Brown University). 

Thesis Co-Supervisor: Albert R. Meyer, Professor of Computer Science. 
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Chapter One 
Introduction 



1.1 Functional and Inclusion Dependencies in the Relational Model 

The development of line relational data model [21, 22] led to major progress in the area of database 
management. The model and its implementations have contributed significantly both to the increase 
of programmer productivity [23] and to the fundamental understanding of computation [62]. 

Among the advantages of the model, which account for its success, are [23]: 

1. The sharp, clear boundary it provides between the conceptual and die physical aspects of database 
management. 

2. Its simplicity, which allows users and programmers to have a common understanding of the data 
and therefore communicate easily about it. 

3. The introduction of truly high level language concepts, which enables users to express operations 
on large pieces of information, without detailed knowledge of its representation or of the access paths 
to where it is stored. 

4. A sound, mathematical foundation, which makes possible the theoretical study of the (often 
formidable) problems of database design and manipulation. 

The relational data model consists of a structural part (with a unique data type, the relation), a 
manipulative part (with powerful algebraic operators such as selection, projection and join) and an 
integrity part (constraints defining consistent database states, intended to capture die semantics of 
particular applications) [62, 51]. A relation is a table with columns named by attributes and with rows 
containing values from some domain, each row being a tuple. A database is a finite set of relations. A 
logical database or database schema consists of a database scheme, i.e. a finite set D of relation 
schemes (sequences of attributes naming die columns of relations), along with a finite set 2 of 
integrity constraints (dependencies), which should be satisfied by all legal physical databases (database 
instances). 

For an example (invariant diroughout the database literature), consider a database of two relations 



R,S, where R has attributes fmpfoyff and managfr and S has attributes managfr and 
dfpartmfnt. If we take as our semantic restrictions that "every employee has exactly one manager" 
and "every manager manages exactly one department", we define the following database schema: 

/)={R[I ; MPL0Y1'I:, MANAGKR], S[MANAGFR, DFPARTMFNT]} 
2 = { RiFMPLOYFF.— ►MANAGFR, SiMANAGFR— H)FPARTMFNT} 

In tJiis case, our constraints arc examples of functional dependencies [21, 22, 62, 51]. Formally, a 
functional dependency (FD) is an assertion of the form R:X— >Y, where R is the name of a relation 
and X,Y are sets of attributes from die relation scheme of R. It is satisfied by a database instance iff 
whenever two tuples of relation R agree on all attributes appearing in X, they also agree on all 
attributes appearing in Y. Observe that, with no loss of generality, we can take Y to consist of a single 
attribute. 

Functional dependencies form a conceptually simple and naturally occuring class of constraints. For 
this reason, they have been extensively studied in die literature (sec [7, 62, 51] for reviews of the area). 
Combined widi the algebraic operators of the relational model dicy provide a practical and elegant 
approach to the problems of database design and manipulation. 

At present, a major research effort is underway towards extending die relational model. This effort 
is motivated in large part by the success of the relational mediodology and by the demands of specific 
application domains, in particular Office Automation (see, e.g., [20, 24, 37, 42, 59, 61], which is by no 
means an exhaustive list). The approach generally taken is to appropriately enrich die integrity part 
by adding constraints which will enhance die expressive power of the model, while at die same time 
preserving its original advantages. 

Returning to our example, suppose we also want to be able to express simple facts such as 
"everyone who manages employees belongs to some department". In other words, we want to add to 
the semantics of our relations that a managfr entry in relation R must also appear as a MANAGER 
entry in rcladon S. This constraint is formally captured by the inclusion dependency [16] 
R:managfrCS:managfr. In general, an inclusion dependency (IND) is a statement of the form 
R:Aj...A m CS:I}|...I] m . Such a statement is satisfied by a database instance iff whenever a tuple with 

entries a-j a ni for attributes Ai,...,A m appears in rcladon R, a tuple with entries a^...,^ for 

attributes Bj,...,!^ appears in relation S. 

Inclusion dependencies make it possible to selectively define what data must be duplicated in 



what relations and thus they provide a valuable tool for database design [24, 59, 69]. The central 
notion of referential integrity [24, 29] can be expressed using IND's. Together with P'D's, IND's form 
the basis of the structural model of [67]. Descriptions of logical databases written in a variety of 
languages can be translated into a common language which uses relations, FD's and IND's [45]. 
Inclusion dependencies have also been employed to map an entity-relationship schema to the 
relational model [20]. We mention in passing that IND's have been commonly known in Artificial 
Intelligence applications as ISA relationships (cf. [9]). 

Although the addition of IND's to the relational model has been recognized as realistic and 
desirable (because of their conceptual simplicity and expressive power), they have become only 
recently the object of theoretical investigation [16, 43, 54, 19, 58, 17, 44, 48, 26]. General questions 
relating to the implication problem for IND's and FD's have been studied in [16, 54, 19]. A rather 
surprising result [54, 19] is that the combination of IND's with FD's is as powerful computationally as 
first-order predicate calculus. This result can be considered both positive (as it hints to the possibly 
rich potential of two simple primitive forms) and negative, as it implies inherent computational 
intractability of the general case. From a more practical standpoint, [43, 17, 44, 26] provide solutions 
to database design and query optimization problems in the presence of (suitably restricted) IND's 
and FD's. Also, central notions such as the Universal Instance Assumption [62, 51] have been 
investigated using IND's [58, 48]. We will review the theoretical work on IND's in more detail in the 
sequel. 



1 .2 The Implication Problem 

The (unrestricted) implication problem for a class of dependencies is the following: Given a finite 
set E of dependencies and a dependency a, test if a holds in all (not necessarily finite) databases 
which satisfy the dependencies in 2. By restricting attention to finite databases, we obtain the finite 
implication problem. 

Solving the implication problem is die main computational task associated with a class of 
dependencies. As a Rile, algoridimic approaches to database schema design and query optimization 
arc based on efficient solutions of the implication problem (sec, e.g., [12, 6, 3, 18, 62, 51]). Evidently, 
if we arc concerned with applications then die finite implication problem is die one which is most 
relevant. However, it tends to be much more difficult to deal with. Moreover, for die classes of 



dependencies for which implication is dccidablc, it generally happens that finite implication 
coincides with unrestricted implication. 

The problem of dependency implication can be approached in a very general setting by 
formulating dependencies as sentences in first-order logic, namely as Horn clauses [34] (sec Section 
5.1 of this thesis for some examples). Closely related to this approach is a particular proof procedure, 
the chase; sec [52, 11,62,51] for its wide applicability (proof procedures for general dependencies 
also appear in [10, 68, 57]). It has been observed that the chase is a special case of a classical theorem 
proving technique, namely resolution [10, 11]. The chase provides straightforward algorithms for 
implication of classes of dependencies for which it can be shown to terminate. Furthermore, in these 
cases the chase produces a finite counterexample whenever implication docs not hold; it is for this 
reason that finite implication coincides with unrestricted implication in these cases. 

Returning now to functional and inclusion dependencies, what appears to be die fundamental 
difficulty is precisely diat IND's can prevent the chase from terminating. Of course, in the case of 
general FD's and IND's one cannot hope to circumvent tliis obstacle, since the implication problem 
is undccidable [54, 19]. Nevertheless, given the practical importance of diese dependencies it makes 
sense to study the complexity of special cases. The obvious approach diat has been suggested is to 
analyze the chase, but this turns out to be a very delicate task (cf. [43]), which can only give partial 
results [43, 26]. Thus, it seems that new tools arc required in order to make major progress. 

The main contribution of this thesis is the introduction of such tools, borrowed from equational 
logic. This is a fragment of first-order logic which has attracted a lot of attention, because of its 
relevance to areas such as applicative languages, interpreters and data types (see [41] for a survey). 
However, it docs not seem to have been noticed by die database theory community, since a constant 
effort has been made to minimize die role of equality in dependencies {multivalued dependencies 
(MVD's) [62, 51], die most widely studied after FD's, do not involve equality). The only case where 
ideas from equational logic were applied in database theory seems to be the best algorithm for 
losslessness of joins (a basic computational problem), which was derived from an efficient algorithm 
for congruence closure [31]. Also, die best algorithm for implication of FD's [6] can be seen directly 
(as we observe) as a special case of an algorithm of [47] for the generator problem in finitely presented 
algebras. 

We use die mediods of equational logic to formulate and study implication problems involving 



FD's and IND's. Wc also use equations to define a new class of dependencies (generalizing FD's) and 
to investigate its implication problem. In the subsequent Sections, wc review in more detail the 
content of each Chapter. 

1 .3 Chapter Two: The Equational Approach to Dependencies 

Let r be a relation over a set of attributes C U, with values taken from a domain 3. Suppose r 
satisfies the FD All— *C, i.e. whenever two tuples of r agree on A,B they also agree on C (here and in 
the sequel wc consider single relations, so wc can suppress relation names from dependencies). Let x 
be a variable ranging over the tuples of r and let a(x) (b(x), c(x)) be a function which assigns to a tuple 
x the entry of x at attribute A (B,C). Now since r satisfies AB-+C, it is easy to see that there is a 
function/(from 9S to 3) such that the following sentence is true in r: 
Vx.Xfl(x),J<x)) = c(x) 

This observation suggests the following syntactic transformation: the FD AB-+C is rewritten as 
an equation 

faxbx = ex, 
where now the symbol a (b,c) is a function symbol of ARITY 1 representing the attribute A (B,C) and f 
is a function symbol of arity 2 corresponding to the FD. Using the standard convention of 
equational logic, wc omit the universal quantifier on the variable x. 

We now illustrate how this equational formalism can be used to infer FD's. 

Example 1.1: Given the FD's 

A-tB^A-^Bj, B^-^C 
we can infer the FD A— +C. Using our transformation, the given set of FD's produces the equations 

f 1 ax = b 1 x, f 2 ax = b 2 x, gb]Xb 2 x = cx. 
From these wc can infer the equation 

gf 1 axf 2 ax = cx. 

In general, we can infer an FD such as A— >C if wc can infer an equation -r[x/ax] = cx, where t is a 
term over the f s and a variable x (in Fxample 1.1, t is the term gf 1 xf 2 x). The notation r[x/ax] means 
that wc substitute ax for x in t. 



Interestingly, this equational formulation can be extended to IND's as well. Suppose relation r 
satisfies the 1ND A^CBjI^, i.e. for each tuple t of r there is a tuple t'of r such that the values oft' 

on Bj,B 2 arc tnc same as the values oft on A 1( A 2 respectively. This means the following sentence is 
true in r: 

Vx3y.[6 1 (y) = fl 1 (x)A6 2 (y) = a2 (x)] 
(as before, x,y arc variables ranging over the tuples of r and a b a2,b^b 2 are functions corresponding to 
the attributes A 1 ,A 2 ,B 1 ,B 2 ). 
Consider now the Skolemization of the existential quantifier By: one obtains the sentence 

Vx. [6 J (/(x)) = fl ] (x) A b 2 (i(x)) = a 2 (x)i 
which is true in r for some suitable function ;'(x) (from tuples to tuples). This suggests transforming 
the IND A 1 A 2 CB 1 B 2 into the set of equations 

b 1 ix = a 1 x, b 2 ix = a 2 x 
(here i is a function symbol of arity 1 corresponding to the IND). 

Example 1.2: From the dependencies 

A 1 A 2 CB 1 B 2 , A 2 A 3 CB 2 B 3 , B 2 -+B 3 

we can infer the IND A 1 A 2 A 3 CB 1 B 2 B3 [16,54]. Using our transformation, the given set of 

dependencies produces the equations 

b^x = a x x, b 2 ix = a 2 x, 
b2Jx = a 2 x, b5Jx = a 3 x, 
fb 2 x = b 3 x. 

From these we can infer 

b 3 ix = fb 2 ix = fa 2 x = fb2Jx = bjjx = a 3 x, 
i.e. we can infer the set of equations 

b ]jx = ajx, b 2 ix = a 2 x, b 3 ix = a 3 x. 

In general, we can infer an IND such as A 1 A 2 A 3 CB 1 B 2 B 3 if wc can infer a set of equations 
bjT = a^, b 2 T = a 2 x, b 3 T = a 3 x, where t is some term over the i's and a variable x (in Example 1.2, r is 
simply ix). 

Thus, wc can use equational reasoning to obtain a proof procedure for FD's and IND's. The 
soundness and completeness of \his approach is demonstrated in Theorem 2.1. As a matter of fact, the 
soundness part (whenever an equation of the appropriate form is implied, the corresponding 



dependency is implied) is easy and it should already be plausible from the preceding discussion. The 
difficult part is completeness (whenever a dependency is implied, an equation of the appropriate 
form is implied). r l*his is proved by a rather delicate induction, which shows that cquational reasoning 
can simulate the chase. 

We can also have a slightly different syntactic transformation of dependencies into equations. This 
transformation, however, docs not have a straightforward semantic justification. 

Consider the FD's in Example 1.1: We can transform them into the equations 
f!a = b,, f 2 a = b 2 ,gb 1 b 2 =c, 
from which we can infer the equation 

gfjaf^c. 
The symbols a,b 1 ,b 2 ,c arc now constant symbols representing the attributes A,B 1 ,B2,C. 

When approached this way, die implication problem for FD's becomes a special case of the 
generator problem for finitely presented algebras [47], for which [47] gives a polynomial-time 
algorithm. By inspecting die behaviour of [47]'s algorithm in this special case, we obtain the linear- 
time algorithm for implication of FD's given in [6]. 

This alternative transformation can also be extended to IND's. We transform the IND 
A 1 A 2 CB i B2 into the set of equations 

ib 1 = a 1 ,ib 2 = a 2 . 
Observe that we have now eliminated the variable x, which can play an essential role when IND's are 
combined with FD's (cf. Example 1.2). For this reason we also need equations of die form 

fix = ifx, 
which permit us to move the f s over the i's and vice versa. The soundness and completeness of this 
approach is also proved in Theorem 2.1. 

The cquational formulation of dependencies is more redundant than die standard one, since we 
need to introduce new symbols (f s and i's). On the other hand, inferences of dependencies now give 
us more information: whenever we infer a dependency a from a set of dependencies 2, the 
associated term t (cf. Examples 1.1, 1.2) tells us how a results (in any database satisfying 2) by 
"composing" dependencies in 2. 

In the remainder of Chapter 2, we use our cquational approach to prove several results relating to 



FD and IND implication. Wc first give a new proof procedure for FD's and IND's ('I'hcorcm 2.2). 
This proof procedure is different in spirit both from the chase and the proof procedure of [54] and it 
treats FD's and IND's in a symmetric fashion. The cquational tools come into play in the proof of 
completeness of this proof procedure. Usually, completeness is proved by constructing a database 
which satisfies a set of dependencies 2 but violates a dependency a (assuming a cannot be proved 
from 2); sec, e.g., [11, 54, 62]. In our case, wc consider the set of equations 8^ obtained from 2 and 
we construct an algebra which satisfies 8 2 but violates any equation that could correspond to a. 

Our second result is a precise characterization of the complexity of acyclic IND's and FD's. 
Intuitively, a set of IND's is acyclic [58] if it docs not contain any cycles of inclusions, such as 
{RiA^CR:^}, {R:ACS:B, S:B'CR:A'} and so on. Acyclic sets of IND's have been proposed 
as a useful tool for database schema design [58]. One can easily observe that the implication problem 
for acyclic IND's and FD's can be solved in exponential time (the chase terminates in this case). NP- 
hardncss lower bounds for the problem were obtained in [26]. 

We show diat the implication problem for acyclic IND's and FD's requires exponential time 
(Theorem 2.4). The main observation is that, when all FD's are unary (i.e. die left-hand side contains 
a single attribute), the cquational inferences of Examples 1.1, 1.2 can be viewed as inferences in 
semigroups (Corollary 2.3). Such inferences can in turn simulate computations of an automaton with 
two pushdown stores. Since such automata are universal computing devices, we obtain a tight 
undccidability result for FD and IND implication (Theorem 2.3). Furdicrmore, the acyclicity 
condition on the IND's corresponds to bounding the size of one of die pushdown stores, which gives 
us exponential time. 

1 .4 Chapter Three: Application to Typed IND's 

A usual assumption in database theory is that all database relations are projections of a single 
universal relation (Universal Instance Assumption [62, 51]). In practice this is not always the case, so 
one has die problem of testing die existence of a universal instance and the problem of adjusting the 
database relations to maintain die existence of a universal instance as die database is updated. Both of 
diese problems are known to be NP-complctc [39]. An alternative, weaker condition we may impose 
on a multi-relational database is pairwise consistency, i.e. every pair of the database relations is 
required to have a universal relation. This condition is easy to test and maintain, as described in 
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numerous works on the subject (sec [8] for a review). In fact, if the database scheme is acyclic [8] then 
pairwisc consistency implies the existence of a universal instance. 

Most of the theoretical work on dependencies is done in the context of databases consisting of a 
single relation, i.e. it assumes the existence of a universal instance [62, 51]. A natural question, then, 
is to investigate the effect of the weaker assumption of pairwisc consistency on the implication 
problem, say for functional dependencies. Although the implication problem for FD's is solvable in 
linear time assuming a universal instance [6], it is not clear even if it is dccidable in the context of 
pairwisc consistency. 

Let rj,r 2 be relations over relation schemes R^Uj], R.2PJ2] respectively. It is not difficult to see 
that r^ have a universal instance iff the projection of r L on U 1 nU 2 is the same as the projection of 
r 2 on UjnU 2 [1]. This can be expressed (with a slight abuse of notation) by the pair of IND's 
R 1 :U 1 nU 2 CR 2 :U 1 nU 2 

These are examples of typed IND's. An IND is typed[17, 48] if it has the form R:A 1 ...A m CS:A 1 ...A m . 
By the above observation, we can then formulate the implication problem for FD's in the presence of 
pairwisc consistency as an implication problem for FD's and (typed) IND's. 

In this Chapter, we apply the equational techniques of Chapter 2 to study the implication problem 
for FD's and typed IND's. The main tool we develop is a proof procedure for general FD's and IND's 
(Theorem 3.1). This proof procedure is different from the procedure of Theorem 2.2 and somewhat 
reminiscent in spirit of the axiomatization of [54]. We prove completeness of the procedure by 
showing that it captures (indirectly) equational inferences as in Examples 1.1, 1.2. 

By analyzing the behaviour of this proof procedure in the case of typed IND's, we obtain a 
decidability result for typed IND's and FD's satisfying an acyclicity condition (Corollary 3.1). We 
then further specialize the proof procedure to the case of unary FD's in the presence of pairwise 
consistency (Lemma 3.2). By a rather complicated analysis of derivations, we show that this 
implication problem is undecidable (Theorem 3.3). This provides a very tight undccidablc case of FD 
and IND implication. 

Finally, we use Lemma 3.2 to show that there is no k-ary axiomatization (involving only FD's and 
IND's) for implication of unary FD's under pairwise consistency (Theorem 3.4; the technical notion 
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of a k-ary axiomati/ation is explained in Chapter 3). This strengthens a previous result of [16] about 
non-existence of k-ary axiomati/.ations for FDs and IND's. 

1.5 Chapter Four: Finite Implication of FD's and Unary IND's 

Given the importance of the finite implication problem, it is natural to ask if our cquational 
approach can be extended to finite implication. Unfortunately, there arc difficulties. The 
completeness part of Theorem 2.1 is proved by analyzing a proof procedure (the chase). However, in 
the case of finite implication of FD's and IND's such a proof procedure does not even exist [54, 19]. 

Nevertheless, we can have a complete proof procedure for finite implication of FD's and IND's, if 
we restrict ourselves to IND's with one attribute per side (unary IND's). Unrestricted implication 
becomes rather uninteresting in this case, because FD's and unary IND's do not interact in any non- 
trivial way (Proposition 4.1). However, in the finite case we have the following interaction: 

from A — fA] and AjDA 2 and...and A m _ 1 — >A m and A m DA 
derive Aj— >A a«JA 2 DA 1 and...and A m —*A. m _± and A I)A m 
(m odd). 

It turns out that this is the only non-trivial interaction: by turning the above observation into a set of 

inference rules (one for each odd m) and including the usual inference rules for FD's [5] and IND's 

[16], we obtain a complete axiomatization for FD's and unary IND's in the finite case (Theorem 4.1). 

The completeness proof is rather long and it involves an intricate construction of a finite 

counterexample relation. We also remark that this axiomatization leads to a polynomial-time 

algorithm for finite implication of FD's and unary IND's [44]. The class of FD's and unary IND's is 

the only known class of dependencies for which unrestricted and finite implication arc both solvable 

without being identical. 

Interestingly, the above axiomatization can also be used to prove an analogue of Theorem 2.1 for 
finite implication of FD's and unary IND's (Theorem 4.2). However, this result is weaker, in the 
following way. Suppose, for example, that we want to test if the FD A— Hi is implied from a set of 
dependencies 2. In the unrestricted case we can show that, if A— +B is implied, then there is a term t 
such that the equation r[x/ax] = bx is implied (cf. Example 1.1); i.e., r[x/ax] = bx holds in all algebras 
which satisfy the equations corresponding to 2. In the finite case, we can only show that, for each 
algebra A as above, there is a term t (depending on A) such that the equation T[x/ax] = bx holds in 
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1.6 Chapter Five: Partition Dependencies 

Wc have presented in Chapter 2 an cquational formulation of functional dependencies. One can 
also have another formulation of quite different flavor, usjng algebraic operations on partitions (this 
seems to be a folklore observation, sec e.g. [15, 60]). 

Specifically, let r be a relation and for each attribute A let 7r A be th£ following partition of the set 
of tuples of r: tuples t,s arc in the same block of w A iff they agree on attribute A. Now it is easy to see 
that r satisfies die FD A-+B iff 

or, equivalently, 

w A = lr A* w B> 

"■b^^a+^b- 
Here < is the usual refines relation and \ + arc the usual product and sum operation on partitions. 

We are thus led to consider general equations over •,+ and the w A 's. We call such equations 
partition dependencies (PD's) [27]. 

We first compare the expressive power of PD's to that of previously studied database constraints, 
namely embedded implicational dependencies [34]. A first observation is that PD's of the form 
77 A = 77 R +T7 C can express symmetric transitive closure (Example 5.2). It follows by a simple 
compactness argument that such PD's cannot be expressed by any set of EID's (Theorem 5.1). On 
the other hand, PD's are unable to detect complicated patterns of equalities in relations and for this 
reason they cannot express, for instance, multivalued dependencies (Theorem 5.2). 

Wc then study the implication problem for PD's. We observe that tine (finite) implication problem 
for PD's is equivalent to the uniform word problem for (finite) lattices (Lemma 5.1). This follows 
from two deep results of lattice theory, namely that (finite) equivalence relations can represent 
arbitrary (finite) lattices [66, 56]. Using techniques from universal algebra [36, 47] and lattice theory 
[28], we show that tiicse word problems arc equivalent and tiiey can be solved in polynomial time 
(Theorem 5.3). 
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Finally, wc examine the problem of testing consistency [38, 64] of a database with a set of PD's. 
Using our polynomial-time algorithm for implication, wc show that it can be reduced to testing 
consistency with a set of Fl)'s [38]. It follows that the problem can be solved in polynomial time 
(Theorem 5.4). 

1.7 Credits 

The research reported in this diesis was done in close collaboration with Paris C. Kanellakis, and 
has been documented in a scries of joint publications [25, 26, 44, 27]. Individual credit for the main 
results goes as follows: 

Theorems 2.1, 2.2, 2.3, 2.4 were obtained jointly, and appeared in [25]. 

Theorems 3.1, 3.2, 3.3 arc due to the author of this thesis, and appeared in [25]. Theorem 3.4 was 
obtained jointly, and appeared in [26]. 

Theorem 4.1 was obtained jointly, but Paris C. Kanellakis was the main contributor; this result 
appeared in [44]. Theorem 4.2 was obtained jointly, and appeared in [25]. 

Theorem 5.3 was obtained jointly, but the audior of tliis thesis was the main contributor; this 
result appeared in [27]. Theorems 5.1, 5.2, 5.4 were obtained jointly, and appeared also in [27]. 

The extension to general dependencies outlined in the concluding chapter is due to the author of 
this thesis. 
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Chapter Two 
The Equational Approach to Dependencies 



Wc present in this Chapter the equational formalization of functional and inclusion dependencies. 
Section 2.1 gives the necessary definitions and background from database theory and equational 
logic. In Section 2.2 wc present the main Theorem and its Corollaries. Wc use it in Section 2.3 to 
prove completeness of a new proof procedure for FD's and IND's. In Section 2.4 wc apply the 
equational formulation to prove new lower bounds for FD and IND implication. 



2.1 Definitions 

2.1.1 Relational Database Theory 

Let Ri be a finite set of attributes and 3 a countably infinite set of values, such that 11(13 = 0. A 
relation scheme is an object R[U], where R is the name of the relation scheme and UC'U. A tuple t 
over U is a function from U to 35. Let U = {A 1 ,...,A n } and \ a value, k = l,...,n; if t[AJ = a k , we 
represent tuple t over U as a 1 a 2 ...a n . We represent the restriction of tuple t on a subset X of U as t[X]. 
A relation r over U (named R) is a (possibly infinite) nonempty set of tuples over U. A database 
scheme D is a finite set of relation schemes {R 1 [U 1 ],...,R q [UJ} and a database d = {ri,...,T q } associates 
each relation scheme RjJUJ in D with a relation r k over U k . A database is finite if all of its relations 
arc finite. A database can be visualized as a set of tables, one for each relation, whose headers are the 
relation schemes (each column headed by an attribute) and whose rows are the tuples. 

The logical constraints which determine the set of legal databases are called database dependencies 
[62, 51]. We will be examining two very common types of dependencies. 

FD R:A 1 ...A n — >A (n>0) is & functional dependency [62, 51]. 
Relation r (named R) satisfies this FD iff, 
for tuples tj, t 2 in r, t 1 [A 1 ...A n ] = t 2 [A ] ...A n ] implies t^A]:^^]. 
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If n= 1, i.e. the left-hand side contains a single attribute, we have a unary functional dependency 
(u-FD). 

INI) S:Dj...D m CR:C]...C m (m>0) is an inclusion dependency [16]. 
Relations s,r (named S,R respectively) satisfy this IND iff, 
for each tuple t in s, there is a tuple tj in r with t i [C k ] = t[D k ], k = l,...,m. 

If m = 1, we have a unary inclusion dependency (u-ID). 

Equality of two columns headed by attributes A,B in a relation named R can be expressed as a 
special case of IND's: Use an IND such as R:ABCR:AA. These dependencies arc particularly 
illustrative of our analysis; we will use A=B to denote them. 

Database Notation: Wc use a graph notation to represent an input database scheme D and a set of 
dependencies 2 (input schema). We construct a labeled directed graph G 2 (see Figure 2-1), which has 
exactly one node a| for each attribute A k of each relation scheme R:. For each IND 
R 2 :D 1 ...D m CR 1 :C 1 ...C m in 2, the graph G 2 contains m black arcs (c|,dp,...,(c^,d^); each arc is 
labeled by the name i of the IND. For each FD R 1 :A 1 ...A n — ► A in 2, the graph G 2 contains a group 
of n red arcs (a^.a ),...,(a n ,a ); the group is labeled by the name f of the FD and its arcs are ordered 
from 1 to n as listed above. 

We also construct two directed graphs I 2 and F% (see Figure 2-1): The graph I x has one node for 
each relation scheme name in D and arc (R;,R k ) iff G^ contains some black arc (A-i,B k ). The graph F 2 
has one node a for each attribute A of D and arc (a,b) iff G^ contains some red arc (a ,b ). We now 
define special syntactically restricted forms of input schemata: 

Acyclic IND's: I^ is acyclic [58]. 

Acyclic FD 's: F 2 is acyclic. 

Typed IND's: The black arcs of G^ are all of the form (A-",A k ) for relation names Rj,R k and attribute 

A [17, 48]. 

Typed IND's arc between occurrences of die same attribute names in different relation schemes. 
If we assume that all possible typed IND's are in the input schema, (i.e., with some abuse of notation 
R:UnU'CS:UnU' for all relation schemes R[U], S[U'] in database scheme D), then we have 
pairwise consistency PC(D) [48]. 
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Implication: Wc say that 2 implies a (2Nct) if, whenever a database d satisfies 2, it also satisfies 
a. Wc say that 2 finitely implies a (2h= fin <7) if, whenever a finite database d satisfies 2, it also 
satisfies a. 

Clearly if 21= a (implication) then 2f= rin cr (finite implication), but the converse is not always true. 
Deciding implication of dependencies is a central problem in database theory. 

Since dependencies arc sentences in first-order predicate calculus with equality, we have proof 
procedures for the implication problem (wc denote provability as 21— a). A proof procedure is sound 
if whenever 21— cr, wc have 2*=<r; and complete, if it is sound and whenever 2Nct, wc have 21— a. 

The standard complete proof procedure for database dependencies is the chase [62, 11]. We now 
present the chase for FD's and IND's (cf. [43]). 

Chase: Given an input schema D, 2 and a dependency a, construct a set of tables T, with D's 
relation schemes as headers. These tables arc originally empty and will be filled with symbols from 
the countably infinite set 3. Whenever we insert a new row of symbols from 5 in a table of T and we 
do not specify some of the entries of this row, we assume that distinct symbols from 3, which have 
not yet appeared elsewhere in T, arc used to fill these entries. Wc use t k for the k-th row of table R 
and tjJX] for this row's entries in the columns of attributes X. 

The initial configuration of T depends on a as follows: 
(i) If a is the FD R:A 1 ...A n — >A: insert rows t[, t^ with the only restriction that 
ti[A k ] = t3A k ],k = l,...,n. 
(ii) If a is the IND S:D 1 ...D m CR:C 1 ...C m : insert t\. 

Every dependency in 2 produces a rule, as follows: 
If f is an FD in 2 the corresponding FD-rule is: 

<Considcr T a database over symbols in 3. If T docs not satisfy f, because two symbols x and y are 
different, then replace y by x in T>. 
If i is an IND R:XCS:Y in 2 the corresponding IND-rule is: 

<Considcr T a database over symbols in 3. If T does not satisfy i, because some t r [X] docs not appear 
in the table S as some t s [Y], then insert t s in S with t s [Y] = t r [X].> 

We will say mat 21— chasc a, if there is a finite sequence of applications of the FD-rulcs and IND- 
rulcs produced by 2 that transforms T's initial configuration to a final configuration satisfying: 
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(i) If a is an FID as above: t'[A] = t[jA]. 
(ii) If ct is an INI) as above: for some j, 
t][D k ] = tj[C k ],k = l m. 

Proposition 2.1: SI— ^^a iff EMct. I 

An alternative proof procedure for FD's and IND's is provided by the axiomatization of [54]. If 2 
is a set of FD's and IND's and a is an FD or 1ND, then ENct iff a can be proved from 2 using the 
following rules (X,Y denote sets of attributes): 

1. (rcflexivity) R:A— >A. 

2. (augmentation)/™//? R:X— >A derive R:XY—> A. 

3. (transitivity) from R:X— >A k , k = l,...,n, R:Ai...A n — »A, deriveX—>A.. 

4. (1ND rcflexivity) R:A 1 ...A m CR:A 1 ...A m . 

5. (IND transitivity) from R 1 :A 1 ...A m CR 2 :B 1 ...B m and R 2 :B 1 ...B m CR 3 :C 1 ...C m derive 
R 1 :A 1 ...A m CR 3 :C 1 ...C m . 

6. (permutation, projection and redundancy): from R:A 1 ...A m CS:B 1 ...B m derive 

R:Ai ...A; CS:Bj ...ft. , where Kj k <m, k = l,„.,p. 

Jl Jp — Jl Jp — h — 

7. (equivalence) from R:ABCS:CC and a derive t, where t is obtained from a by 
substituting A for one or more occurrences of B. 

8. (pullback)/ram R:A 1 ...A n ACS:B 1 ...B n B a/K/S:B 1 ...B n -*B derive R:A 1 ...A n -»A. 

9. (collection) from R:A I ...A n B 1 ...B m CS:A{...A;BJ...B rn , R:B 1 ...B m CCS:Bi...B^C and 
S:Bi...B^-+C'f/e«veR:A 1 ...A n B 1 „.B m CCS:Af...AnBi...B^C*. 

10. (attribute introduction) from R:A 1 ...A n CS:B 1 ...B n and SiB^-.B,,— >B derive 
R:A 1 ...A n NCS:B 1 ...B n B, where N is a new attribute. 

Rules 1-3 are the standard rules for FD's [5, 62] (written in our notation) and Rules 4-6 are the 
rules of [16] for IND's without repeated attributes. The salient rule is attribute introduction (Rule 10). 
Whenever this rule is applied, the attribute N is chosen to be an attribute which does not appear in 2 
or in any previous step of the derivation. Rule 10 is sound in the following sense: Whenever the 
antecedents arc true in relations r,s (over relation schemes R,S respectively), there is a relation r' 
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which differs from r only on a new column headed by N and which satisfies the conclusion. 

2.1 .2 Equational Logic 

Let M be a set of symbols and arity a function from M to the nonncgativc integers X 'Hie set of 

finite strings over M is M*. Partition M into two sets: 

G = {g£M| ARiTY(g) = } is die set of generators, 
0= {0£M\ arity(0)>O } is the set of operators. 

Definition 2.1: ^J(M), the set of terms over M, is the smallest subset of M* such that, 

1) every g in G is a term, 

2) if tj T m arc terms and 8 is in O with ARlTY(0) = m, then 0T 1 ...T m is a term. 

A subienn of t is a substring of t, which is also a term. Let V = {x,x I ,x 2 ,...} be a set of variables. 
The set of terms over operators O and generators GUV will be denoted by ^ (M). For terms t^...,^ 
in ^"(M) we have a substitution cp = { (x k *— T k ) | k = l,...,n }, which is a function from ^"(M) to 
1* (M). We use (p(-r) or i-[x 1 /T 1 ,...,x n /T n ] for the result of replacing all occurrences of variables x k in 
term t by term -r k , k = l,...,n, where these changes are made simultaneously. 

Definition 2.2: A binary relation ~ on 5(M) or 9**" (M) is a congruence provided that, 

1) ~ is an equivalence relation, 

2) if ARlTY(0) = m andr k ~T k , k = l,...,m, then 0T 1 ...T m £S0T{...T^ r 

An equation c is a string of the form t = t' where t,t 'are in ^"(M). We use tlie symbol E for a set 
of equations. We will be dealing with models for sets of equations, i.e., algebras. We consider each 
equation c as a sentence of first-order predicate calculus (with equality), where all the variables from 
V arc universally quantified. 

Definition 2.3: An algebra A is a pair (A,F), where A is a nonempty set and Fis a set of functions. 
Each/ in F is a function from A n to A, for some n in JV which we denote as typeij). 

Example 2.1: 

(a) A semigroup (A,{ + }) is an algebra with one binary operator which is associative, i.e., for all x,y,z 
in A we have (x + y)+z = x + (y + z). An example of a semigroup is the set of functions from Jf to X, 
together with the composition operation. In semigroups we use ab instead of a + b. We also omit 
parentheses, without ambiguity. 
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(b) A M is an algebra with A = ( 3{M). For each 6 in O wc define a function in F with 
typc{8) = A\uiY(dy, here wc use the same symbol for the syntactic object 6 and its interpretation. 

The function 8 maps terms r x T m from G J(1V1) to the term Or v ..r m , (i.e., 0(T ] ,...,T m ) = 0T 1 ...T m ). 

This algebra is referred to as the free algebra on M. From this example it is clear that we can without 

ambiguity use both 8r l ...r m and 0(t 1 T m ) to denote the same term. 

(c) Let ~ be a congruence on 1J(M). Condition (2) of Definition 2.2 guarantees that the operations 
in O arc well-defined on ~-cquivalcncc (or congruence) classes. Thus wc can form a quotient 
algebra *3{M)/~ with domain {[t] | t in G J(M), [t] is the ~ -congruence class of t} and with functions 
corresponding to the operators in O. 

(d) Observations similar to (b),(c) can be made for the set of terms ^"(M). 

Implication: Let c be an equation and A an algebra. A satisfies e, or is a model for e, if e becomes 
true when its operators and nonvariable generators are interpreted as the functions of A and its 
variables take any values in the domain of A. The class of all algebras which are models for a set of 
equations E is called a variety or an equational class. We say that E implies e (ENc) if the equation e 
is true in every model of E. 

Definition 2.4: An equational theory is a set of equalities E (of terms over ^"(M)), closed under 
implication. 

Sec [41] for a survey of equational theories. 

We write Eh- c, if there exists a finite proof of e starting from E and using only the following five 
rules: 
r-r, 

from ti = ti deduce t 2 = t^, 
from tj = t 2 and t 2 = T3 deduce tj = T3, 

from T k = T^, k = l,...,m, deduce 0T 1 ...T m = 0TJ...T in (ARlTY(0) = m), 
from r\ — T2 deduce <p(t{) = ^{r^) (<p is any substitution). 

Proposition 2.2: [14, 41] ENt = t' iff EH-t = t.' I 

Proofs in the above system can also be viewed as reduction sequences, as follows [41]: Whenever 
ENt = t\ there is a sequence of terms T ,...,T m such that t is t, T m is t\ and for k = 0,...,m-l the 
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term r k+ | is obtained from T k by rewriting a subtcrm (p(<T]) as 9(02), where a\ = 02 ( a 2~ a ' s an 
equation in E and <p is a substitution. 

Let r be a set of equations over terms in ^(M) (i.e., containing no variables). Consider the 
cquational theory consisting of all equations T = r'such that TNt^t.' By Proposition 2.2 this theory 
induces a congruence = r on ^J(M), where T = r r'iff r*=T = T'. From example (c) above we sec that 
this congruence naturally defines an algebra 5tM)/ = j-. If V is a finite set, < 3{M)/ = l - is known as a 
finitely presented algebra [47]. 



2.2 Functional and Inclusion Dependencies as Equations 

Let 2 be a set of FD's and IND's over a database scheme D and a an FD or IND. We will 
transform 2 into two sets of equations R 2 and 6^. We will show that 2f=cr iff K 2 NE T iff 8 x t=8 T , 
for some sets of equations E T ,8 T whose form depends on 2 and a. We assume that D only contains 
one relation scheme. This simplifies notation, and there is no loss of generality. 

Transformation: From the dependencies in 2 construct the following sets of symbols: 

M f = {f k I for each FD with n attribute left-hand side include one operator f k of ARITY n}, 

M i = {i k I for each IND include one operator i k of arity 1}, 

M a = {a k I for each attribute A k include one operator a k of ARITY 1}, 

M a = {a k I for each attribute A k include one generator a k }. 

Now let M = M f UM i UM a UM a and V = {x,x 1 ,x 2 ,...} be a set of variables. ^(M,-) (^ (M;)) are the 

sets of terms constructed using operators in M^M;) and generators in V. 

The set E 2 consists of the following equations (presented in string notation): 

1) one equation for each FD A^.. A n — » A: ^■ ] x...a^n = ax, 

2) m equations for each IND B 1 ...B m CA 1 ...A m : a 1 i k x = b 1 x and ... and a m i k x = b m x. 

The set S 2 consists of the following equations: 

3) one equation for each FD A|...A n — ► A: ^ a \— a n ~ a > 

4) m equations for each IND B 1 ...B m CA 1 ...A m : \ k a l =fl 1 and ... and \a m =ji m , 

5) for each pair of symbols f p in M f and i q in Mj the equation f p i q x 1 ...i q x n = i q f p x 1 ...x n 

(ARlTY(f p ) = n). 

Note that in 8^ only equations (5) contain variables. Equations (5) are commulativity conditions 
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between the f k 's and the i k 's. Wc now present Theorem 2.1, which is central to our analysis. 

Theorem 2.1: In each of the following three cases, (i),(ii),(iii) arc equivalent. 
= Case: 
i)2t=A=B 
ii) E^ 1= ax = bx 
iii)S 2 l=a=/?. 

FD Case: 

i) 2 N A^.A^A 

ii) E^ N T[x ] /a 1 x,...,x n /a n x] = ax, for some t in "if*" (M f ) 

iii) S^ N= r[x i /a i ,...,\ n /a n ] = a, for some t in^T f (M f ). 
IND Case: 

i)2NB 1 ...B m CA 1 ...A m 

ii) E 2 1= a^^^ and ... and a m T = b m x, for some t in ^ (Mj) 

iii)6 2 N= T[x/a i ] = y3 1 and... and T[x/a m ] = y3 m , for some t in'Sf'CMj). 

Proof: Observe that the = Case follows immediately from the IND Case, by writing A=B as 
ABCAA. We use E T (8 T ) to denote die set of equations corresponding to term t in (ii),(iii). 

(ii)=>(0: 
Suppose E S N=E T , and let relation r satisfy 2; we will show that r satisfies a (a is A^.A,,— >A in the 
FD Case and B 1 ...B m CA 1 ...A m in the IND Case). Relation r is, by definition, nonempty and its 
entries can be assumed w.l.o.g. to be positive integers. Let the tuples of r be t]^,... (it could contain 
a countably infinite number of tuples). 

For each attribute A in Ri, define a function c(.):Jf—>Jf (.Wis the set of nonn egative integers) so that, if 
v is the index of a tuple in r, then a{v) is the entry in tuple t„ at attribute A; else (Ay ) is 0. 
For each FD C^X— >C in 2, define a function /...): $-*N so that, if a k = t„[C k ], k = l,...j, then 
/a 1 „..,aj) = t„[C]; clseXai.—.aj) is 0. This is a well-defined function, since r satisfies C[...Cj— X. 
For each IND D 1 ...DjCC 1 ...Cj in 2, define a function /(.):Jf— >N so that, if v is the index of a tuple in 
r, then i(y)-v\ where v' is the index of the first tuple in r where t v [D 1 ...Dj] = t„-[C 1 ...Cj]; else /(i»)is0. 
This is also a well-defined function, since r satisfies Dj...DjCCj...Cj. 

Wc have constructed an algebra with domain if and functions fl(.),...Jf ...),. •■,'(■)>••■> which, as is easy to 
verify, is a model for Ej;. Let a be an IND. By interpreting each symbol in t as an /(.), we sec that, 
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when v is a tuple number, t[x/>] is another tuple number. Since H^NH T< wc must have 
<7 1( (t[x/*']) = i k (x), k = l,...,m, which means that r satisfies a. The case of an FD is similar. 

(iii)^(ii): 
Suppose S^t=8 T , and let Jk> be a model of E 2 ; we will show that Jft> satisfies E T . From Jk> we 
construct a model A(J[>) for 8^. The domain of A{Jk) is the set of all functions from JH> to J&>, i.e., 
Jk-tJk. 

In A(Jk) the interpretation of a is die function a(x), which is die interpretation of a(.) in Jl>. The 
interpretation of i(.) is the function Xh.h(i(x)), where /(x) is the interpretation of i(.) in J&>. This is a 
function from Jh>-+Jh> to ^— »Jt. The interpretation of f(...) is die function 

\h v ..h n .j{h { ()i.) // n (x)), where _/(xj,...,x n ) is the interpretation of f(...) in Jh. This is a function from 

(Jk-*Jk) n toJk-*Jh. 

It is straightforward to check that equations (3),(4) hold in J.(Jb), because JH> is a model for E 2 . 
Also equations (5) hold in A(JI[>): For example, if n = l die interpretation of f(i(/;)) in A(J\j) is 
J{h{i{\)), which is also the interpretation of i(f(/0) (h is any element of Jh-* Jh). Thus A{ A\>) is a 
model for 6 2 . Since S^NS,., A(J[>) satisfies g T . From this it follows that Jk> satisfies E T . 

(i)=>(iii): 
IND Case: 

Consider a chase proof of B 1 ...B m CA l ...A m from 2. This chase starts from a single tuple i x and 
generates tuples t 2 ,...,t„, where t v [A 1 ...A m ] = t 1 [B 1 ...B m ]. Now a tuple can only be generated by 
applying an IND-rule on some previously generated tuple. Thus, we can assign (inductively) to each 
tuple tp, p = l,...,p, a term r p in iT^Mj), as follows: 

1. t 1 = x. 

2. If t p was generated from t q , q<p, by applying die IND-rule corresponding to some IND i in 2, then 
T p = Tq [x/ix]. 

The term T p records the sequence of applications of IND-rules which produced t_ (starting from tj). 

Wc will show die following 

Claim: For l<p,q<>, C,D in 11, if t p [C] = t q [D], then 6 i; NT p [x/7] = T q [x/5], where y,S arc the 
symbols in M a corresponding to C,D. 

Clearly, the IND Case follows from the Claim: Since t„[A L ...A m ] = t 1 [B 1 ...B m ], we have 
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8 i; l=T„[x/a k ] = /S k , k = l m. 

Proof of Claim: Suppose the equality t p [C] = t q [D] appears after exactly z steps of the chase. We 
argue by induction on /.. 

Basis: z = 0. Then p = q = 1, C is D, and the conclusion is straightforward. 

Induction Step: Let tp[C] = K, t q [D] = A. The symbols k,X were equated by the chase. We 
distinguish three cases, according to how this happened. 

a. k is a freshly created symbol, identical to A. This means t p was created from tp-, p'<p, using an 
IND X 1 C'X 2 CY ] CY 2 in 2 (X^C^, k = l,2), and t p [C'] = t q [D]. By the induction hypothesis 
g x t=T p -[x/y] = T tl [x/S]. Now T p = T p -[x/ix], where i is the operator corresponding to 
X 1 C'X 2 CY 1 CY 2 , and also iy = y'is in S^. Thus, 8 2 NT p .[x/i7] = T q [x/fi]. i.e. S 2 l=Tp[x/y] = T q [x/5]. 

b. k was equated to A in order to satisfy some FD C 1 ,..Cj-+C in 2. This means 
t p [C 1 ...Cj] = t q [C 1 ...Cj], and D is C. By the induction hypothesis 8 2 N=T p [x/y k ] = T q [x/y k ], k = l,...j. 
Also, we have in 8 2 the equation fy 1 ...yj = y, where f is the operator in M ; corresponding to die FD 
C 1 ...Cj->C. Thus, 8 2 implies fT p [x/y 1 ]...T p [x/yj]-Tp[x/fy 1 ...y j ] (by die commutativity conditions 
(5)) =T p [x/y]. Similarly 8 2 implies fT q [x/y 1 ]...T q [x/y j ]^T q [x/fy 1 ...yj] = T q [x/y], so 

8 2 Nr p [x/y]= Tq [x/y]. 

c. There arc tuples tp-,t q -, p'<p, q'<q, and C',D' in <U such that tp.[C] = k, t q [D] = A, and tp[C] 
was equated to t q [D1 at some earlier step. Then by the induction hypothesis Sj; implies 
T p [x/y]=T p .[x/yl,T q [x/S]-T q .[x/51, and T p [x/y] = T q -[x/81 Thus, S 2 f=T p [x/y] = r q [x/5]. 

FD Case: 

Consider, as before, a chase proof of A 1 ...A n -+A from 2. This chase starts from two tuples t 1; t 2 and 

generates tuples t3,...,t„; finally, t 1 [A] = t 2 [A]. Again a tuple can only be generated by applying an 

IND-rule on some previously generated tuple, so we can assign (inductively) to each tuple tp, 

p = \,...,v, a term r p in ^ (Mj), as follows: 

1.t 1 = x 1 ,t 2 = x 2 . 

2. If L, was generated from t q , q<p, by applying die IND-rule corresponding to some IND i in 2, then 

T p = T q[ x l /ix l' X 2 /ix 2]- 

Observe that T p also records the tuple (t± or t 2 ) which produced tp (apart from the sequence of 
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applications of INl>rulcs). 
Wc will show die following 

Claim: For l<p,q<f, C,D in °U, if yC] = t q [D], tnen 8 x l=T p [x k /Y] = T q [x k /5] (k = l,2). If, 
additionally, L, is produced from t 1 and t q is produced from t 2 , then 8 2 Implies 
T p [x 1 /7] = T q [x 2 /5] = T[x 1 /a l x n /a n ], for some t in^T t "(Vff)- 

Clearly, the IND Case follows from the second part of the Claim: Since t 1 [A] = t 2 [A], 
6^1=0 = T[x ] /a 1 ,...,x n /a n ], for some t in^T f (M f ). 

Proof of Claim: Suppose the equality L[C] = t q [D] appears after exactly z steps of the chase. We 
argue by induction on z. 

Basis: z = 0. Then p = q = l, C and D are both some A k , l<k<n, and the conclusion is 
straightforward. 

Induction Step: Let t p [C] = /c, t q [D] = \. The symbols k,X were equated by the chase. We 
distinguish three cases, according to how this happened. 

a. k is a freshly created symbol, identical to X. This means tp was created from t p -, p'<p, using an 
IND X 1 C'X 2 CY 1 CY 2 in 2 (X^COl, k = l,2), and tp-[C'] = t q [D]. For the first part of the Claim, 
wc argue exactly as in the IND Case. For the second part, note that if t p is produced from t x then so is 
tp-. Therefore we can use the induction hypothesis on tp-,t q . 

b. k was equated to X in order to satisfy some FD C^Xj—vC in 2. This means 
tJC 1 ...Cj] = t q [C 1 ...Cj], and D is C. The argument for the first part proceeds exactly as in the IND 
Case. For the second part, note that since S 2 implies T p [x 1 /y k ] = r k [x 1 /a 1 ,...,x n /a n ], k = l,...j 

(by die induction hypothesis), we have that 8^ implies 

T p [x 1 /y] : =T p [x 1 /fy 1 ...y j ] = fTp[x 1 /y 1 ]...Tp[x ] /Y j ]=fT 1 [x 1 /a 1 ,...,x n /a n ]...T j [x 1 /a 1 ...x n /a n ] = 

= r[x 1 /a 1 ,...,x n /a n ], where t is fr^-.Tj. Similarly, 8^ implies 

Tq[x 1 /Y] = T[x 1 /o ] ,...,x n /a n ]. 

c. There are tuples tp-,t q -, p'<p, q'<q, and C',D' in HI such that tp[C] = k, t q [D1 = X, and tp[Cl 
was equated to t q -[D] at some earlier step. The argument for the first part proceeds exactly as in the 
IND Case. For the second part, if t_- was produced from t 2 , use the induction hypothesis on tp,tp-; 
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else, if L- was produced from t 2 , use the induction hypothesis on tp-, t q -; else, use the induction 
hypothesis on t q -, t q . 

This concludes the proof of (i)=>(iii), so we arc done. I 

We remark here that the (i)=>(iii) direction can also be proved by showing that each of the rules 
of [54] (sec Subsection 2.1.1) can be simulated using the cquational reasoning of Proposition 2.2. We 
illustrate this simulation with an example: 
From A— >B and CDCAB the pullback rule of [54] derives C-+D. In cquational language fa=/?, 

ia=y, i/3 = 5 and fix = ifx imply fy = fia = ifa = i/? = 8. 

Corollary 2.1: Let 2 be a set of FD's and a an FD. The implication problem 2l=cr is equivalent 
to a generator problem for a finitely presented algebra [47]. 

Proof: 8 X is now a finite set of equations with no variables. If ~ is the congruence induced by 8 2 
on °3(M) then 9(M)/~ is a finitely presented algebra. The cquational implication in Theorem 2.1 is 
known, in this case, as a generator problem for the finitely presented algebra ^(M)/~. I 

Using Corollary 2.1, one can observe that the linear time algorithm of [6] for implication of FD's 
can be derived in a straightforward way from the algorithm of [47] for the generator problem. 

Corollary 2.2: Let 2 be a set of FD's. The implication problem 2N=A=B is a uniform word 
problem for a finitely presented algebra [47]. I 

If the given FD's are all unary, then the cquational inferences in the theory E s can be thought of 
as inferences in semigroups. This gives yet another transformation of (unary) FD's and IND's into 
equations: 

Semigroup Transformation: Let S be a set of IND's and u-FD's. Construct a set of symbols M s 
from M as follows: for each f k (.) in M f add one generator f k in M s ; for each i k (.) in M; add one 
generator i k in M s ; for each a k (.) in M a add one generator a k in M s ; add one binary operator + in M s . 

The set of equations E§ consists of the associative axiom for + and the following word (string) 
equations (we omit + and parentheses): 

1) one equation for each u-FD hy- >A: f t ai =a, 

2) m equations for each IND B 1 ...B m CA 1 ...A m : a 1 i k = b 1 and ... and a m i k = b m . 
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Corollary 2.3: Let 2 be a set of u-FD's and IND's: 
2NA = BiffK s Na = b. 

ENAj— >A iff KgN waj = a, for some string w in M*. 
2NB J ...B m CA ] ...A m iff H s t= a,w = b ] and ... and a m w = b m , for some string w in M*. I 

Note that the first case is an instance of the uniform word problem for semigroups. The other two 
cases arc known as V^-unificalion problems [41]. 

2.3 A Proof Procedure for FD's and IND's 

We will now describe a proof procedure for FD and IND implication, which exploits the special 
structure of die cquational theory 8^. (Theorem 2.1). Whenever a dependency a cannot be proved 
from a set of dependencies 2, die procedure provides us (in a natural way) with an algebra which 
satisfies & x but violates any equation that could correspond to a. Thus, by Theorem 2.1 we have that 
2 docs not imply a, i.e. die procedure is complete for FD and IND implication. 

The Proof Procedure G: 

Given a set 2 of FD's and IND's construct their graphical representation G^ defined in Subsection 

2.1.1. Each attribute name in 2 is associated with one of the nodes of G 2 . 

Rules: Apply some finite sequence of the graph manipulation rules 1,2,3 and 4 of Figure 2-2 on G^. 

Rules 1 and 2 introduce new unnamed nodes. Rules 3 and 4 identify two existing nodes; the node 

resulting from this identification is associated with the union of the two sets of attribute names that 

were associated with each of the identified nodes. Note that rules 1,2 w.l.o.g. need be applied at most 

once to every left-hand side configuration. 

Let G be the resulting graph. Associate a unique new name with every unnamed node in G. 

We say that 2h- G cr when: 

a is A=B: A,B arc associated with the same node. 
a is an FD \ v .A n — >A: The node associated with A gets marked by die following algorithm: We 
mark the nodes associated with Aj,...,A n ; whenever nodes v^...,^ arc marked and there is a group of 
red arcs (v 1; v),...,(v:,v) labeled by die name f of some FD in 2, we mark v. 
a is an IND B 1 ...B m C A^.A^. For k = l,...,m there is a black directed padi from A k to B k ; moreover, 

all these padis have the same sequence of labels. 
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Note that, as expected, the A = BCasc is a specialization of the INI) Case: if 2h- 6 AIJCAA, then 
A,IJ can be identified using Rule 3. 

Theorem 2.2: 2Na iff Sh-^a. 

Proof: 

(<=): Rules 3,4 are obviously sound. Rules 1 and 2 arc sound in the sense of the attribute introduction 
Rile of [54] (sec Subsection 2.1.1), which we illustrate as rule 5 of Figure 2-2. 

(=>): Let G be a (possibly infinite) graph obtained by closing G x under Rules 1-4. We will 
construct from G a model J&> of 8^. 
The domain hi of Jta is die set V of nodes of G, together with a special node ±. The generator a k is 

interpreted as the node associated with A k . 

An operator i in 6 2 (corresponding to some IND in 2) is interpreted as a function i:M—*M as 

follows: if v is in V and has an outgoing arc (v,w) labeled i, then /(v) = w; else /(v)= _L. This function 

is well-defined, because G is closed with respect to Rule 3. 

An operator f of arity j in 8 2 (corresponding to some FD in 2) is interpreted as a function f.M 1 —* M 

as follows: if v 1 ,...,Vj arc in V and there is a group of red arcs (v 1 ,v),...,(vj,v) labeled f, then 

/v 1 ,...,Vj) = v; else/v 1 ,...,Vj)=±. This function is well-defined, because G is closed with respect to 

Rule 4. 

One can check that Jk satisfies die commutativity conditions (5) of 8^ (because G is closed with 

respect to Rules 1,2) and Jk satisfies equations (3),(4) of 8 2 (because G was constructed starting from 

Gj-). Thus, J\d is a model of 8 2 . 

Now suppose we cannot prove a from 2. If a is an FD Aj.-.A,,— *A, then clearly there is no t in 

^"(Mf) such that T[x 1 /a 1 ,...,x n /a n ] = a in Jk. Thus, M> is a counterexample to condition (iii) of 

Theorem 2.1 and therefore 2 does not imply a. Similarly if a is an IND. I 



2.4 Computations as Inferences 

It has been known, since at least Post's proof of the unsolvability of die word problem for Thue 
systems [55, 50], that arbitrary computations can be simulated by inferences in semigroups. Using 
Corollary 2.3, we show that we can simulate computations by inferences of IND's and unary FD's. 
We thus obtain lower bounds on the complexity of the implication problem for IND's and FD's. 
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Wc first describe our machine model: A deterministic two-stack machine M is a 5-tuple 
(Q,n,q slart ,h,S), where Q is a finite set of states, n is a finite set of symbols (Qfl n = 0), q start £Q is 
the start state, h€Q is the halt state, and 8 is the transition function. Each move of M falls into one of 
the following two types: 

1. 5(q,a) = (p,POP ] ): This means that, if M is in state q and a £11 is the top symbol of 
STACK ^ then on the next step M goes to state p and pops STACK}. 

2. S(q) = (p,PUSllj()8)): If M is in state q, then on the next step M goes to state p and pushes 
/8€rionSTACK 1 . 

Of course, analogous instructions can manipulate STACK 2 . 

An instantaneous description (ID) of M is a string x ] ...x n qy nr ..y 1 , where q€Q, x^y-fiTl: the string 

Xj-.x,, is the contents of STACK y (the top symbol is x n ); the string y^.^ is the contents of STACK 2 
(the top symbol is y m ). The relation Wj=> M w 2 (ID w l yields ID w 2 via one step of M) is defined in 
the standard way [50, 40]. =>fa is the reflexive, transitive closure of =>m- 

Let us now define a set S of word equations (over generators QUll) which capture the 
computation of M: 

1. If 5(q,a) = (p^'OP^, then aq = p is in S. 
If 8(q,a) = (p,POP 2 ), then qa = p is in S. 

2. If S(q) = (p,PUSil l (y3)), then q = £p is in S. 
If 5(q) = (p,PUSli 2 03)), then q = p/8 is in S. 

We write u = s v iff St=u = v. By a standard argument, based on the fact that M is deterministic 
[55, 50], we have 

Lemma 2.1: q start =>foh iff q start = s h. I 

To prove our first lower bound, we transform S into another set of equations T which looks like 
the sets obtained (as in Corollary 2.3) from IND's and u-FD's. The set of generators is now 
QU{A a ,B a ,f Q | a€Il}U{i a | a€n}U{j e | e€S}. 

1. Ifqa = pis in S, thcnqi a = p is in T. 

2. If aq = p is in S, then T contains the equations q = AJ c , f a A a = B a , B Q j e = p, where e is 
aq = p. 
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Lemma 2.2: q flart = s h iff q slart = T h. 

Proof: Given a word w over QUT1 of the form a 1 ...a n q^ m ...y8 1 , q€n, aj,/3j€n, define a 

corresponding word w' to be f„ ...f qip ...i,> . We claim that, if w,,w-, arc words over QUn, then 

"1 "n Pm P\ ' L 

Wj = s w 2 iff Wj = T w 2 . The Lemma follows from this claim. 

To prove the "only if direction of the claim, consider the equations in S that can be used to 
rewrite Wj as w 2 . Ifqa = p is in S, then qi a = T p, since qi a = p is in T. If aq = p is in S, then f a q= T p, 
since f a q = ( f a A a j c =j B a j e = r p. The converse is also straightforward. I 

Theorem 2.3: The implication problem for IND's and two u-FD's is undccidable. 

Proof: Given a deterministic two-stack machine M, it is undccidable if q start =t > j^h, even if |n| = 2 
[53,40]. By Lemmas 2.1 and 2.2, q^m^M* 1 iff q start = T h. By Corollary 2.3, q s t ait =T h ^ 
2NQ start =H, where 2 is the set of IND's and FD's which gives rise to T. But now observe that 2 
only contains FD's of the form A a — >B a , a€n. Since |n| = 2, 2 only contains two unary FD's. I 

Undccidability of the implication problem for IND's and FD's has already been proved [54, 19]. 
By way of comparison, these reductions use arbitrarily many IND's of the form D^CC^ and 
arbitrarily many u-FD's, while our reduction uses arbitrarily many IND's and only two u-FD's. 

To prove our second lower bound, we consider computations of a deterministic two-stack machine 
M where one of the two stacks has bounded size. Let us write w 1 =j>^ i w 2 iff ID w 2 follows from ID Wj 
by a computation of M during which STACK 2 contains at most s symbols. 

Let S be the set of word equations described before: this time we transform S into a set X s of 
equations which can be obtained (as in Corollary 2.3) from acyclic IND's and u-FD's. The set of 
generators now is Q°U...UQ s U{A a ,B a ,f a | a en}U{i a k | a€n, k = l,...,s}U 
U{j ek | e€S, k = 0,...,s}, where Q k = {q k | q€Q}, k = 0,...,s. 

1. If qa =p is in S, then q k+1 i Qi k + i = p k is in T s , k = 0,..., s-1. 

2. If aq = p is in S, then T s contains the equations q -Aj ek , f a A a = B a , BJ ek = p , 
k = 0,...,s, where e is aq = p. 

It is not hard to sec that T s can be taken to represent a set 2 s of acyclic IND's and u-FD's: the 
relation names are R[A a B a | a€fl], R k [Q ], k = 0,...,s. It is also easy to sec the following 
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Lemma 2.3: q slarl ^h iff q° arl = ,s h°. iff Z S N R°: Q^sH . ■ 

Theorem 2.4: There arc constants Cj,c 2 >0 such that the implication problem for acyclic IND's and 
FD's can be solved in time cj but not in time c 2 n/loen . 

Proof: Since the IND's are acyclic, the chase gives us a decision procedure, running in exponential 
time. 

To prove the lower bound, let L be any language in DTIME(c n ), c>0. We will show that L is 
polynomial-time reducible to the implication problem for acyclic IND's and u-FD's. 
Let M be a deterministic n-AuxiliaryPushdownAutomaton accepting L [40]. Given string x, we 
construct a deterministic two-stack machine M x which first puts x on STACK 2 and then simulates 
M. This simulation is done as follows: if M is in state q, its auxiliary storage contains aj...a n aw (a is 
the symbol scanned) and its stack contains uy9 03 is the top symbol), then the ID of M x is 
u/Jot! /j...a n oqaw. It is not hard to see how M x can simulate a move of M. Thus, M accepts x iff M x 
halts and stack 2 always contains at most |x| symbols, i.e. x€L iff q start = > |y| h. Note aiso that the size 

ofM x , |M X |, is 0(|x|). 

Now let l) x > be the set of acyclic IND's and u-FD's corresponding to M x . Using Lemma 2.3, xGL iff 
2' X 'N= R : Q starl =H . To complete the proof, observe that l) x ' can be computed from x in 
polynomial time, and that the size of s'*' is 0(|M X | |x| log|x|), i.e. 0(|x| log|x|). I 
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Chapter Three 
Application to Typed IND's 



In this Chapter wc use the tools developed in Chapter 2 (Section 2.2) to study the particular 
implication problem for FD's and typed IND's. Wc first present a proof procedure for general FD 
and 1ND implication (Section 3.1), similar in spirit to the proof procedure of Theorem 2.2. By 
specializing this proof procedure to typed IND's, we obtain as a corollary that the implication 
problem for acyclic FD's and typed IND's is decidablc (Section 3.2). In Section 3.3 we study the 
special case of inferring FD's under pairwise consistency. By analyzing derivations (in the proof 
procedure of Section 3.1), we show that the problem is undccidable. We also prove that there is no k- 
ary axiomatization for implication of FD's under pairwise consistency. As a by-product of our 
techniques, wc obtain finite controllability of acyclic unary FD's under pairwise consistency. 

3.1 Another Proof Procedure for FD's and IND's 

We present in this Section a proof procedure for general FD and IND implication. This procedure 
is the main tool we use to study the implication problem for typed IND's and FD's. To prove 
completeness of the procedure, wc show that it captures (in an indirect way) equational inferences in 
the theory E^ of Theorem 2.1. 

Let 2 be a given set of FD's and IND's over a database scheme D, containing a single relation 
scheme Rf^]. We represent attribute A^^ by a node a k . An FD A 1 ...A n -+A in 2 is represented as 
shown in Figure 3-1 by introducing a node fai...a n (we use a different function symbol f for each 
given FD), a group of directed arcs (a j5 fa 1 ...a n ),...,(a n , fa^.a^) labeled f and ordered from 1 to n, and 
an undirected arc <fa I ...a n , a>. The undirected arc is the only modification to our graph notation of 
Section 2.1.1. Its purpose is to represent the equation fa 1 x...a n x = ax. 
An IND B 1 ...B m CA 1 ...A m in 2 is represented (see Figure 3-1) by introducing directed arcs 

(a 1 ,b J ),...,(a m ,b m ), labeled i (we use a different label for each given IND). 
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I.ct H^ be the mixed graph obtained from 2 as described above. Repeatedly apply Rules 
'\\lransitivity), H[. 2 {equality), lj. 3 {introduction) (sec Figure 3-2) on Hj;, in some arbitrary fixed 
order, until no more rules arc applicable. As was the case with Rules 1,2 in Theorem 2.2, the 
introduction rules need only be applied once for each left-hand side configuration. 

Let H = (N H ,A H ,H H ) be the mixed graph obtained this way (N n is a set of nodes, A H is a set of 
labeled directed arcs on N n , and R ( j is a set of undirected arcs on N n ). Notice that each node of H is 

labeled Fuj...u q , where F is a term over the function symbols and uj u q arc nodes representing 

attributes (by a slight abuse of notation, we write Fuj...u q as a shorthand for F[x l /u 1 ,...,x q /u q ]). 
Moreover, every subtcrm of Fii]...u appears as a node of H. 

By a path labeled r, where t is a term over the i's (and a variable x), we mean a mixed path where the 
sequence of labels corresponds to t (sec Figure 3-1). In the special case where t is simply x, the path 
consists of undirected arcs. 

The graph H fully captures implication of FD's and IND's from 2, as we now show: 

Theorem 3.1: 
FD Case: 

ENA^.-Ajj— >A iff there is a node Fa^-.a,, of H such that <Fa 1 ...a n , a>€E H . 

IND Case: 

ZNB 1 ...B m CA 1 ...A m iff there is a path from a k to b k labeled t, k = l,...,m, where r is a term over the 

i's. 

Proof: Let E 2 be the set of equations of Theorem 2.1. Assume that the various names in E 2 are 
consistent with the names in H. 

(<=): 

Claim: 

(i) If <Fuj...u p , Gv 1 ...v q >€E H , where the u k 's, Vj's are nodes corresponding to attributes and F,G are 

terms over the fs, then E 2 l=FujX...u p x = Gv 1 x...v q x. 

(ii) If (Fu^.-Up, Gv^.-Vq) is a directed arc labeled i, then E 2 NFu 1 ix...u p ix = Gv 1 x...v q x. 

Clearly, the "if direction follows from the Claim, by Theorem 2.1. 

Proof of Claim: We prove both (i) and (ii) by simultaneous induction on the number of 
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applications of rules that created an (undirected) arc of H. 

Basis: No rules were applied. The conclusion is straightforward. 

Induction Step: We have to check Rules T, K^, 1 1.3, each of which might have been applied at 
the last step. 

Rules'!', K t Straightforward. 

Rule Fo The undirected arc <Fu 1 ...u p , Gv 1 ...v q > was created from the undirected arc 
<F'uj...Up-, G'vj...v q ->, where (F'uj...u'-, F"u 1 ...u p ), (G'vj...v'-, Gvj...v q ) arc directed arcs labeled i. By 
the induction hypothesis, F^ implies F'ujx...u'-x = G'V|X...v'-x, F"u J 'ix...u'-ix = Fu 1 x...u p x, 
G'vjix...v q -ix = Gv 1 x...v q x. Thus, F,^ implies Fu 1 x...u p x = Gv 1 x...v q x. 

Rule \ l The undirected arcs <F 1 u i ...u p , G 1 v 1 ...v q >,...,<F n u 1 ...u p , G n v 1 ...v q > create the undirected 
arc <Fu]_...Up, G\i...\„>, where F = fF 1 ...F n , G = fG 1 ...G n . By the induction hypothesis, E% implies 
F k u 1 x...u p x = G k v ] x...v q x, k = l,...,n. Thus, E^ implies 
Fu 1 x...UpX-fF 1 u 1 x...u p x ... F n ii 1 x...u p x-fG 1 v 1 x...v q x ... G n v 1 x...v q x = Gv 1 x...v q x. 

Rule I 2 The directed arcs (F 1 U]...u p , G 1 v 1 ...v q ),...,(F n u 1 ...u p , G n Vj„.v q ) (labeled i) create the 
directed arc (Fu^.-Up, Gv^.-Vq) (labeled i), where F=fF 1 ...F n , G = fG 1 ...G n . By the induction 
hypothesis, E^ implies F k u 1 ix...u p ix = G k v ] x...v q x, k = l,...,n. Thus, E 2 implies 
Fu 1 ix...u p ix = fF 1 u 1 ix...u p ix ... F n u 1 ix...u p ix = fG 1 v 1 x...v q x ... G n v 1 x...v q x = Gv 1 x...v q x. 

Rule I3 Identical to Rule I 2 . 

(=>): Let u be a node of H labeled F'u 1 ...u p , where the u k 's are nodes corresponding to attributes. 
We denote by ut the term Fu^.-UpT. 

Claim: Suppose E 2 implies Fu 1 T...u p T = Gv 1 p...v p, where tlie u k 's, Vj's correspond to arbitrary 
nodes of H, F,G are terms over the f s, and r,p arc terms over the i's (and a variable x). Also assume 
Fu^.Up is a node of H, and there are nodes w k , k = l,...,p, such that there is a path from u k to w k 
labeled t. 'ITien Gv^.Vq is a node of H and there is a path from Gv^-.Vq to Fw^.-Wp labeled p. 

The "only if direction follows easily from the Claim, by Theorem 2.1. 

Proof of Claim: If E^i=a — a', then there is a sequence of terms a ,...,a m such that <j is a, a m is 
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a\ and for k = 0,...,m-l the term a k+1 is obtained from a k by rewriting a subtcrm y{6{) as <p(#2)> 
where 0i = <?2 (^2~^\) > s an equation in K^ and <p is a substitution (Proposition 2.2). We call such a 
sequence a proof of the equation a = a'. 

Wc define a relation -< on pairs of terms as follows: 

(f.f'H^Oj.il') iff K^ implies f = £'and 17 = 7)', and cither 
(i) the shortest proof of £ = f ' is shorter than the shortest proof of -q = 17 ', or 
(ii) the above proofs have the same length, and £ is a proper subtcrm of tj, f ' is a proper subtcrm of ij' 

Obviously, -< is well-founded, so wc can argue by induction on -<. Let CT ,...,cr m be a shortest 
proof of the equation Fu 1 T...u p T = Gv ] p...v q p. 

Basis: m = 0. Using I 2 , I lt wc sec by an easy induction on the structure of F tliat there is a node 
Fw^.-Wp and a path from Fu^.-U- to Fw L ...w labeled t (see Figure 3-3). 

Induction Step: We assume that the Claim holds for all equations f = f ' implied by E 2 , where 
(f >i")"^<(F u i T - u p' r . Gv 1 p...v q p); we will show tliat it holds for the equation Fu 1 T...u p T = Gv 1 p...v q p. 
We distinguish two cases: 

Case 1: For k = 0,...,m-l, a k+1 is obtained from <r k by rewriting a proper subtcrm. This means F is 
fF 1 ...F n , G is fGj„.G n , and F s UjT...u p T is rewritten as G^p.-.v^p, s = l,...,n. Now for s = l,...,n, 
F s U|...u p is a node of H and (FjUjT.-.UpT, G s v 1 p...v q p)-<(Fu 1 T...UpT, Gv 1 p...v q p), so by the induction 
hypothesis G^./Vq is a node of H and there is a path from G^..^ to F^..^ labeled p (see 
Figure 3-4). Now by Rules I 2 , I x and an easy induction on the structure of F s , there is a path from 
F^..^ to Fj.w^.Wp labeled t; then by Rules I 2 , Ij there is a node fF 1 w 1 ...w p ... F^.-Wp, i.e. a 
node labeled Fw^.-Wp. It follows by Rules Ij, I 3 that tiierc is a node fG^v^.-Vq ... G^..^, i.e. a node 
Gv^.-Vq, and that there is a path from Gv^-.v- to Fw^.-Wp labeled p. 

Case 2: For some k, 0<k<m-l, a k is rewritten into a k+1 . Wc distinguish four subcases: 

Case 2a: Fu^.-.tipT is rewritten as fa^...a n ^, then as a£ using an equation fa 1 x...a n x = ax in E 2 
and then as Gv 1 p...v q p. Clearly (Fu^.-.UpT, fa 1 £...a n £)-<(Fu 1 T...u t, Gvjp...v q p), so by the 
induction hypothesis there is a path from faj...a n to Fw^.w- labeled £ (sec Figure 3-5). Since 
<fa 1 ...a n , a>€F H , there is a path from a to Fwj...w p labeled £. We also have 
(a£, Gv J p...v q p)-<(Fu J T...u p T, Gv 1 p...v q p), so by the induction hypothesis Gv^.Vq is a node of H 
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and there is a path from Gv'|...v q to Fw^.Wp labeled p. 

Case 2b: FiiiT...u p T is rewritten as a£, then as fa^...a n ^ using an equation fa 1 x...a n x = ax in E 2 
and tlicn as Gv ] p...v q p. Clearly (Fu^.-UpT, a£H<(Fu 1 T...iipT, Gvjp...v q p), so by the induction 
hypothesis there is a path from a to Fw^.-Wp labeled £ (see Figure 3-6). Since <fa 1 ...a n , a>€E H , there 
is a path from fa^.^ to Fwj...w p labeled £. We also have 

(faj£...a n £, Gv 1 p...v q p)-<(Fu 1 T...UpT, Gv]p...v q p), so by tlic induction hypothesis Gv^.Vq is a node 
of H and there is a path from Gv 1 ...v q to Fw 1 ...w_ labeled p. 

Case 2c: FujT.-.UpT is rewritten as a£, tlicn as bi£ using an equation ax = bix in E% and then as 
Gv 1 p...v q p. Clearly (FujT...u p T, a(H(FU|T...u t, Gvjp...v p), so by the induction hypothesis there 
is a path from a to Fw 1 ...w p labeled £ (sec Figure 3-7). Since there is a directed arc (b, a) labeled i, 
there is a path from b to Fw^.Wp labeled i£. We also have 

(bi£, Gv ] p...v q p)-<(Fu 1 T...UpT, Gvjp.-.Vqp), so by the induction hypothesis Gv^.v is a node of H 
and there is a padi from Gv^.Vq to Fw^.w- labeled p. 

Case 2d: Fu^-UpX is rewritten as bi£, then as a£ using an equation ax = bix in E% and then as 
Gv 1 p...v q p. Clearly (Fu2T...u p T, bi£)-<(Fu 1 T...u p T, Gv 1 p...v q p), so by the induction hypothesis there 
is a path from b to Fw^.Wp labeled i£ (see Figure 3-8). Now there is a node c on this path such that 
the subpath from b to c is labeled i. Since there is a directed arc (b, a) labeled i, by Rules E 1 , F/j, T we 
have <a, c>€E H . Thus there is a path from a to Fw^.Wp labeled £. We also have 
(a£, Gv ] p...v q p)-<(Fu 1 T...u p T, Gv 1 p...v q p), so by the induction hypothesis Gv^.-Vq is a node of H 
and there is a path from Gv-^.v to Fw^.w labeled p. 

This concludes the Proof of the Claim, so we are done. I 

We remark here that Theorem 3.1 can be strengthened using the axiomatization of [54] for FD's 
and IND's (see Subsection 2.1.1). Specifically, we can show that we need not use Rule I 3 in the 
construction of H. To see this, consider the following sets of dependencies: 
F II = {u 1 ...Up— >u | u k , k = l,...,p and u are nodes of H such that <Fu 1 ...u p , u>£Ejj}. 
I H = {U]...u q Cv 1 ...v q | u k ,v k arc nodes of H such that there is a path from v k to u k labeled t, k = l,...,q, 
where r is a term over the i's}. 

Here we assume that Rule I 3 was not used in die construction of H. Clearly 2CF H UI H . Moreover, it 
is straightforward (but lengthy) to verify diat F H UI H is closed under the rules of [54] (using the fact 
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that H is closed under Rules T, K|_ 2 , li_ 2 ). Therefore, St=A 1 ...A n — >A iff a|...a n -»a is in Kjj and 
2NB 1 ...B m CA 1 ...A m iff bj...b m Caj...a m is in \ n . This stronger version, however, is not necessary for 
our purposes. 

3.2 Typed IND's and Acyclic FD's 

Suppose we are given a set 2 of FD's and typed IND's, over database scheme /) = {R k [U k ]: 
k = l,...,q}, L^CRl. An attribute A= of relation scheme R k is now represented by a node a; of H-% (cf. 
the graph notation of Section 2.1.1). The FD's and IND's in 2 arc represented in H^ as explained at 
die beginning of this Section. We use a different label P k for each typed IND 
R k :A 1 ...A m CR j :A 1 ...A m in2. 

The fact that 2 contains only typed IND's induces a special structure on the graph H (of Theorem 
3.1), which we will now analyze. Consider die graph F^ of Section 2.1.1. This graph has a node a for 
each attribute A in Ri and a group of red arcs (a 1 ,a),...,(a n ,a) labeled f for each group of red arcs 

k k k k 

(aj,a ),...,(a n ,a ) labeled f of H 2 . We define two partial functions type, node on the set of terms (over 
the a 's and the f s). If t is a term, type(r) is die name of a reladon scheme in D and node(r) is a node 
of F 2 . The functions type, node arc defined inductively as follows: 

1. For each attribute A of R k , lype(a k ) = R k , node(a k ) = a. 

2. If /ype(Tj)=R k and node(r^ = Vj for j = l,...,n, where there is a group of red arcs 
(v 1 ,v),...,(v n ,v) labeled fin F-^, dicn /y/?e(fT 1 ...T n ) = R k , «oJe(fT 1 ...T n )=v. 

The crucial property of H (in the case of typed IND's) is given in the following 

lamina 3.1: The functions type, node arc defined on all terms that appear as labels of nodes of 
H. Moreover, 

1. If fr l ...T n is a node of H then for j = l,...,n we have typc(T^ = R k and iwde{r^ = \y where there is a 
group of red arcs (v 1 ,v),...,(v n ,v) labeled fin F 2 . 

2. If<u,v> is an undirected arc of H dien type(u)= type(v) and node(u) = node(\). 

3. If (u,v) is a directed arc of H labeled P k dicn type(u.) = R y type{\)-R k and node(u)= node{\). 

Proof: Straightforward simultaneous induction on the number of applications of rules that 
produced a node (arc) of Ft. I 
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Assume now that l : x is acyclic: It is not hard to sec that in this case each node of F^ can be the 
image (under node) of at most an exponential number of terms (in the size of h\). Therefore by 
Lemma 3.1 the size of H is at most exponential, and by Theorem 3.1 we obtain 

Corollary 3.1: The implication problem for acyclic FD's and typed IND's is dccidable. I 

In particular, implication of an FD can be tested in exponential time, and implication of an IND 
can be tested in nondetenninistic exponential time (by guessing appropriate paths of H). Whether 
these bounds can be improved is an open question. 

We remark here diat if 2 is a set of FD's and typed IND's over database scheme D and 2Nct, 
where a is an IND, then a must be typed. This follows easily from Theorem 3.1 and Lemma 3.1, but 
can also be seen directly as follows: Consider a database d which associates to each relation scheme 
R k of D a single tuple t k , where t k [Aj]=j, AjGIL. Clearly d satisfies all FD's and all typed IND's (over 
D), but violates any IND which is not typed. 



3.3 Inference of FD's under Pairwise Consistency 

Let 2 be a set of FD's over database scheme D and let PC(£>) be the set of all typed IND's over D 
(recall that PC(Z)) expresses the fact that the database is pairwise consistent). By the remark at the 
end of the previous Section, PC(/))UZ docs not imply any new IND's, so we need only be concerned 
with implication of FD's. Furthermore, observe that if a database d over D satisfies PC(D), then 
R^A^.A^A holds in relation R k iff Rj:A 1 ...A n ->A holds in relation Rj, where R k [U k ], Rj[Uj] both 
contain attributes A lv ..,A n ,A. For this reason we can suppress relation names from FD's. 

In the presence of only typed IND's, every term that appears as label of a node of the graph H (of 
Theorem 3.1) is of the form Fa^-ap 1 , where /ype(Fak...a£) = R k ; this is an easy consequence of Lemma 
3.1. Now suppose we have pairwise consistency, tiicre is a node labeled Faj[...ap, and A m appears in 
relation scheme Rj, m = l,...,p; then there is a directed arc labeled i kj from aj^ to a? m . Thus, by Rule I 2 
(and an easy induction on the structure of F) there is a node labeled Fa^.ap. This observation allows 
us to represent the graph H more succinctly, by having only one node a m for each attribute A m and a 
node Fa 1 ...a p for each term 

k k 
Fa^-.ap that appears as a label of a node of H. 
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This representation can be further simplified if the FD's in 2 arc all unary. In this case all we need 
to observe is that the terms that appear as labels of nodes correspond to paths in the graph l\ (recall 
that F x is a directed graph with a node a m for each attribute A m and an arc (a k ,aj) for each FD 
A k -+Aj in 2). Moreover, it is not difficult to sec that cf//such paths will appear as labels of nodes. We 
now give the formal details of this representation. 

Let V be the set of nodes of F^. For each attribute A m , let T A be the following (possibly infinite) 
directed tree: 

the set of nodes P A Ca m V* is die set of all paths in F x which start at a^ (denoted as sequences of 
nodes); 
the set of arcs is {(sa k , sa k aj) | s£V*, sa k €P A , A k -+Aj€2}. 

Let P= U A pen P A . Define E to be the smallest set of undirected arcs on P which contains <s,s> 
for all s€P and <a k aj, aj> for all A k — » Aj in 2, and is closed under the following rules: 

1. Propagation: If <sa k , s'a k >€E, then <sa k 3j, s'a k aj>€E for all A k -+A: in 2. 

2. Pseudo-Transitivity: If <s 1 ,s 2 >, <s 2 ,s 3 > are in E, s k €P A and there is a relation scheme in 
D which contains A^A^A^ then <s 1 ,s 3 > is in E. 

By die preceding remarks and Theorem 3.1, we have 

Lemma 3.2: PC(Z>)U2NA k -4Aj iff <s,3j>€E for some s€P A . I 

Example 3.1: Figure 3-9 has an example where ^={Ro[A ] Q i Q 2 B], R ] [AA 1 Q 1 ], RJAiQ^QJ, 
R 3 [A 2 Q 2 B]} and 2 is {A-»Q lf Aj-+A 2 , A 2 -+B, Qj-^Aj, Q 2 ^B}. In this case, PC(Z))U2N=A-*B. 

The "only if direction of Lemma 3.2 can also be proved by a counterexample construction. 
Suppose <s,aj> is not in E, for any s in P A ; we will construct a pairwise consistent database d over D 
which satisfies the FD's in 2 but violates A k — >Aj. 

For each attribute A m in Ri the domain of A m , 3S A , consists of all functions f.P A — >{0,1} such that, 
if <s,s>€E, s,s'€P A , thcn/s)=Xs'). 

Let U n be A 1 ...A p . Wc construct a relation r n over R n [U n ] as follows: A tuple f^../ (4€^ A ) is in r n 
iff, for any s in P A , s'in P A (1<k,A<p) with <s,s>€E, wchave/ K (s)=/ x (s'). 
It is easy to sec that the database d consisting of the relations r n satisfies the FD's in 2 (by the 
definition of the set E). Wc also claim that d is pairwise consistent. The key observation is diat, if 
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A K ...A K is any subset of U n , then the projection of r n on A K ...A K consists of exactly those tuples 
f K ...f K for which f n (s)=f c (s') whenever <s,s'>€E (Ii,C in A K A K ). Finally, one can verify that if 
<s,aj> is not in E, for any s in P A then d violates A k -> A:. 

r fhc above construction produces in general an uncountable counterexample. Observe, however, 
that if 2 is acyclic then each P A is finite, so the counterexample is finite. It follows that for acyclic 
unary FD's under pairwisc consistency, finite implication coincides with (unrestricted) implication: 

Theorem 3.2: The class of acyclic unary FD's under pairwisc consistency is finitely controllable. I 

We now make some simple remarks about the set of undirected arcs K. Observe that, if <s 1 , s 2 >€E 
and S]S', s 2 s' arc in P, then <s ] s', s 2 s'>€E. This is an easy consequence of Propagation. Also, if 
<aS], as 2 >€E and sas lt sas 2 are in P, then <saS), sas 2 >6E. To sec this, suppose s is s'b, where b is a 
node such that B— +A is in 2. Then <ba, a>€E, so by Propagation <bas x , as 1 >€E. Similarly 
<bas 2 , as 2 >€R. Then by Pseudo-Transitivity <bas 1 , bas 2 >£E. We are now ready to prove the main 
result of this Section. 

Theorem 3.3: The implication problem for unary FD's in the presence of pairwisc consistency is 
undccidable. 

Proof: We reduce the uniform word problem for semigroups (Thuc systems [50]) to implication of 
u-FD's under pairwise consistency. We assume that we are given a set S of word equations of the 
form 0^01 = a k ; the problem is to determine whether St=a 1 a 2 = a 3 . Recall that this happens iff the 
string a 3 can be obtained from the string a 1 a 2 by successively replacing a substring w : by a substring 
w 2 , where v/ 1 = w 2 (w 2 = wj) is an equation in S. 

For each given equation in S, say a^a: = a k , we include in our database scheme relation 
schemes R^, K^, Rj. 3 , L, M^, as shown in Figure 3-LO. The directed arcs represent unary FD's. 
There arc two general-purpose attributes X,Y. For each a m there arc two attributes A m ,B m , and for 
each equation there is a set of attributes Q^g. 

If the equation to be inferred is aia2 = a^ then we include in the database scheme relation 
schemes Rj. 7 , K^, R{_ 3 , L, J 2 . 3 and FD's as in Figure 3-10 (where now A^Bj are A x , \\, A:,Bj are 
A 2 ,B 2 , A k ,B k are A 3 ,B 3 , and we have used attributes 0[. 8 ). We will show that the u-FD Q'(,-*Q is 
implied iff SNaja 2 = a 3 . Let P be a set of nodes and E a set of undirected arcs as in Lemma 3.2. 
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Claim: The undirected arc<xa 1 b ) yxa2b2y, xa 3 b3y> is in H iff SN=aja 2 =:a3. 

Proof of Claim: Wc will give a characterization of the set K. Let c be an equation ajOj = a k in S, 
and suppose c gives rise to relation schemes Rj. 7 , K]. 2 , Ri-3, L, M x _ 2 , as in Figure 3-10. Consider the 
following sets of undirected arcs which correspond to c (all these arcs arc in E): 

E\: 
<xaj, aj>, 

<ajbj, bj>, <q 1 b i , bj>, <a i b i , qfaX 
<b ; y, y>, <q 2 y, y>, <b s y, q 2 y>, 
<yx, x>, <q3X, x>, <yx, q3X>, 
<xa:, aj>, <q 4 a„ a:>, <xa:, q 4 a;>, 
<ajbj, bj>, <q 5 bj, bj>, <ajbj, q 5 bj>, 
<bjy, y>, <q 6 y, y>, <bjy, q 6 y>, 

\X3t-, 3t,x, 

<a k b k , b k >, <q 7 b k , b k >, <a k b k , q 7 b k >, 
<b k y, y>, <q 8 y, y>, <b k y, q 8 y>. 

E|: 

<xa i b i , qfif, 
<a i b i y, q 2 y>, <q 1 b i y, q 2 y>, 
<b ; yx, q 3 x>, <q 2 yx, q 3 x>, 
<yxaj, q 4 aj>, <q 3 xaj, q 4 aj>, 
<xajbj, q 5 bj>, <q 4 ajbj, q 5 bj>, 
<ajbjy, q 6 y>, <q 5 bjy, q 6 y>, 

<xa k b k , q 7 b k >, 

<a k b ky- w>> <q7 b ky' qsy>- 

<q 1 b i yx, q 3 x>, <q 2 yxaj, q 4 3j>, <q 3 xajbj, q 5 bj>, <q 4 ajbjy, q 6 y>, 
<xa k b k y, q 8 y>. 
h 4 . 
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<q L b i yxa j , q 4 ;tj>, <q 2 yxa j b j , q 5 bj>, <q 3 xajbjy, q 6 y>. 



h 5 . 



<q 1 b i yxa j b j , q 5 bj>, <q 2 yxajbjy, q 6 y>. 



K-- 



<xa i b i yxa j b j y, q 6 y>. 

<q 6 y- <*&>< <xa k b k y, q 6 y>, <xa i b i yxa j b j y, q 8 y>, 
<xajbjyxa:bjy, xa k b k y>. 

It is not difficult to see that for each equation c in S, k = 1,...,7, E k is contained in E (compare with 
Figure 3-9). 

Now consider the following set of arcs E': Let <Si,s 2 > be a member of some E£ (for some e,k), and 
suppose s' is obtained from s by successively replacing a subsequence xajbiyxajbjy by a subsequence 
xa k b k y (or vice versa), where a^a; = a k is in S. If s 1 s, s 2 s' are in P, then put <s 1 s, s 2 s> in E'. Also if s, s' 
are in P, then put <s, s> in E\ 

By the remarks immediately preceding the statement of Theorem 3.3 (and the fact that E k CE) we 
have E'CE. Furthermore E' contains the arcs initially put in E, and clearly it is closed under 

Propagation. It is also straightforward (albeit a bit tedious) to verify that E' is closed under Pseudo- 
Transitivity. Therefore ECE', and thus E=E\ The Claim now follows from this characterization of 

E. 

To finish the Proof, observe that Q' 6 — +Q is implied (Lemma 3.2) iff <xa 1 b 1 yxa 2 b 2 y, xa 3 b 3 y> is in E 
(cf. Figure 3-10). I 

We will now show that there is no k-ary axiomatization for implication of u-FD's in the presence 
of pairwise consistency. 

Let D be a database scheme and 9 a set of sentences about D (for instance, FD's and IND's). An 
axiom system for implication of sentences in 9 is k-ary [16] iff it is universe-bounded (i.e. only 
attributes in D arc mentioned) and every rule has at most k antecedents, for some fixed integer 
k. Observe that the axiom system of [54] for implication of FD's and IND's is not k-ary, because Rule 
10 violates the boundedness condition (see Subsection 2.1.1). 
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Let 2C6, a in 6. Wc say that 2 is closed under implication iff whenever 2t=a we have <r€2. Also, 
2 is c7as«/ i/wAr Ar-tfry implication iff whenever 2'l=a, where 2'C^i, |2'|<k, we have a£2. The 
following characterization for the existence of k-ary axiomatizations is taken from [16]: 

Proposition 3.1: There is a k-ary axiomatization for implication of sentences in 9 iff whenever 
2C9 is closed under k-ary implication, 2 is closed under implication. | 

Theorem 3.4: There is no k-ary axiomatization for implication of u-FD's under pairwise 
consistency (wc consider here axiomatizations involving arbitrary FD's and IND's). 

Proof: Let 11 be {A,A L A^Q^.^Q^B} and let I) bo a database scheme over 11 consisting of 

relation schemes R^AQ^.Q^B], R^AAjQ^, R j [A j _ J Q j _ 1 A j Q j ], j = 2,...,k, R k + i [A k Q k B]. Let be the 
following set of FD's over D: R^ A— > A b R:.- Aj_ 1 -+ Aj, j = 2,...,k, Rj:Qj v1 — ► Aj, j = 2,...,k, 
R k+i :A k-* B . Rk+i:Qk-» B . Ro:Qj-*B, j = l,...,k (cf. Figure 3-9 for die case k = 2). 

Consider tlie set <J>' of FD's which are consequences of <I>. The set <&' can be constructed by 
closing $ under Rules 1,2,3 of the axiom system of [54] (see Subsection 2.1.1). Let 2 be 4>'UPC(Z)). 
We will show that 2 is not closed under implication (of FD's and INP's), but is closed under k-ary 
implication (of FD's and IND's). Theorem 3.4 will then follow by Proposition 3.1. 

For the first part, it is not hard to sec tiiat 2Nct, where a is R :AwB (cf. Figure 3-9). Since a is 
not in 2, we are done. 

For the second part, suppose 2'Nct, where 2'C2, |2'|<k, a is an IND or an FD. We will show 
that a is in 2. 

If a is an IND, then a must be typed, by die remark at the end of Section 3.2. Thus a is in 2. 

Suppose now a is an FD R p :C 1 ...C„-+C , where 0<p<k + l and all the Cj's are in 11. Since all 
the FD's in 4> arc unary, it easily follows from Theorem 3.1 that 2'N=R p :C m — ►Cq, for some m, 
l<m<q. We will argue that R p :C m — >C is in $; from this it easily follows that a is in $', i.e. it is in 
2. 

Consider the nodes c m , c of the graph F-% (cf. Figure 3-9). If there is no directed padi from c m to 
c , then we can construct a relation r over H which satisfies all die FD's in <3> (without their relation 
names) but violates C m — >C . We can then project r over die Rj's to obtain a database d over D which 
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satisfies 2 (and thus also 2") and vk^ates R p :C m -<^. 

Thus, there is a directed path from c,, to fy Since C^ C abo appear a the same relation name, it is 
easy to check mat Rp:C m -^ is in #. unless R^Ca-MC^ is R^ A-*R. However, since fSljgk one of 
me FD's R^A-fAj, Rj:A hl -tAj, j=2,...,k, R k+1 :A k -*B must be mining from 2 'and therefore we 
cannot have S'NR^A-tB (since there is no directed path from a to b in ¥$■). This concludes the 
proof. I 
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Figure 3-2: Graph rules for FD's and IND's 
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Figure 3-9: Kxamplc of FD inference under pairwisc consistency 
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Figure 3-10: Gadgets for Proof of Theorem 3.3 
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Chapter Four 
Finite Implication of FD's and Unary IND's 



A natural question is whether our equational approach can handle finite implication of database 
constraints. Ideally, wc would like to be able to replace N by N fin tiiroughout Theorem 2.1. It is 
easily seen that the same arguments can show that (iii)=>(ii) and (ii)=>(i) in the finite case (the 
constructions given map finite counterexamples to finite counterexamples). The argument for 
(i)=>(iii), however, breaks down, because it is based on the existence of a complete proof procedure 
for implication (namely the chase) and such a proof procedure cannot exist for finite implication 
[54, 19]. As a matter of fact, the same syntactic nature of the proofs of Theorems 2.3 and 3.3 prevents 
us from proving undecidability of finite implication. The weaker proofs of [54, 19], because of their 
semantic nature, can easily be done for the finite case. 

However, Theorem 2.4 also holds for the finite case: By the discussion above one can see that N 
can be replaced by (= fin in Theorem 2.1 if we have a finitely controllable class of FD's and IND's, i.e. 
a class where N fin is the same as N. Acyclic IND's and FD's provide an easy example of such a 
class, because the chase in this case constructs a finite counterexample if the implication docs not 
hold. Another example of a finitely controllable class is acyclic unary FD's under pairwise 
consistency (Theorem 3.2). 

If N fin is different from N, we might still be able to handle the finite case if there is a complete 
proof procedure for finite implication. In this Chapter we provide such a class: wc show that there is 
a complete proof procedure for finite implication of FD's and unary IND's. This proof procedure is 
then used to prove a (weaker) analogue of Theorem 2.1. for finite implication of FD's and u-ID's. 

Let 2 be a set of FD's and u-ID's over a database scheme D containing a single relation scheme 
Rfll]. If a is an FD or u-ID, we will show that 2l= f]n a iff a can be proved from 2 using the 
following set of rules (*). We use X,Y to denote sets of attributes. Wc denote a u-ID ACB 
alternatively as BDA. 



54 



Rules (*): 

1. (rcflcxivity) A-+A, AtM. 

2. (augmentation)/™//? X-+A derive XY— ►A, AC !!. 

3. (transitivity)/row X-+A k , k = l,...,n, A L ...A n -+A, deriveX—>A, ACM. 

4. (u-ID rcflcxivity) ACA, ACM. 

5. (u-ID transitivity) from ACB W BCC derive ACC, A.B.CC'U. 

6. (cycle rules) For every odd positive integer m and attributes A k , 
from A — >Aj and A\DA 2 and...and A m _j— >A m a/;rf A m DA 
cfen've A|— »A and AfDAi and. ..and A m — *A m _j a/ft/A DA m . 

Rules 1,2,3 arc the standard rules for FD's [5] (written in our notation) and Rules 4,5 are the 
specialization of the general IND rules of [16] to u-ID's. Thus, Rules 1-5 arc sound for general 
databases (infinite as well as finite). A simple counterexample construction shows that Rules 1-5 are 
also complete for unrestricted implication of FD's and u-ID's. More specifically, FD's and u-ID's 
decouple in the case of unrestricted implication. 

Proposition 4.1: Let E F be a set of FD's and Zj a set of u-ID's. 

1. 2 F U2jNX— A iff 2 F l=X-fA. 

2. SpUSjNACBiff SjNACB. 

Proof: The "if direction is obvious in both cases. We will show the "only if direction. 

1. Suppose E F docs not imply X— *A. Let X + ={B | BCM, 2 F i=X— >B}. Consider a relation r 

consisting of tuples t k , k = 0,1,2,..., where to[B] = 0, BCM, and for k = l,2 t k [B] = k-l if B€X + and 

t k [B] = k otherwise. It is easy to sec that r satisfies the FD's in 2 F (the only tuples to check are t^), 
and obviously r satisfies all u-ID's. Now since A is not in X + , r violates X— »A. Therefore, EpUEj 
does not imply X— >A. 

2. Suppose 2j docs not imply ACB. Let Gj be a directed graph which has a node a m for each 
attribute A m in M and a directed arc (a;,a k ) for each u-ID A k CA; in 2 F By our assumption, there is 
no directed path from b to a in Gj (cf. Rules 4,5). Thus, wc can assign to each node u of Gj a number 
c(u) so that c(u)<c(v) whenever there is a directed path from u to v, and c(b)>c<a) (this can be done 
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by a topological sort of the dag of strongly connected components of G] [2]). Now consider a relation 

r consisting of tuples t k , k = 0,l,2 where for A m in °\1 we have t k [A m ] = k + c(a m ). Clearly r satisfies 

all u-ID's in 2j and violates ACB. Moreover, r satisfies all FD's, so 2 F U2j does not imply ACB. I 

As a matter of fact, the cycle rules are not sound for infinite databases: Consider a relation r over 
relation scheme R[AB], consisting of tuples t k , k = 0,l,2,..., where t k [A] = k, t k [B] = k + l: clearly r 
satisfies B— f A, ADB, but violates BDA. On the other hand, a simple counting argument shows that 
the cycle rules arc sound in the finite case. Let |r[A]| denote the cardinality of column A of relation 
r. If the antecedents of a cycle rule hold in r we have |r[A ]| = |r[A 1 ]| = ... = |r[A m ]|. Now if a finite 
relation r satisfies |r[A]| = |r[B]| and A-+B, it easily follows that it satisfies B-+A. Similarly, from 
|r[A]| = |r[B]| and ADB it follows for finite databases that BDA. 

In order to analyze the rules (*), we use a graph notation for dependencies similar to the notation 
of Subsection 2.1.1. If 2 is a set of FD's and u-ID's, G x is a graph which has a node a m for each 
attribute A m , a red arc (a k ,aj) for each FD A k — >Aj in 2, and a black arc (aj,a k ) for each u-ID A k CAj 
in 2. If between nodes u,v of G 2 we have red (black) arcs in both directions, we replace them with an 
undirected red (black) edge. The transitivity and cycle rules imply that, when A k — >Aj (A k DAj) 
corresponds to some arc in a directed cycle of G^, we can infer Aj— >A k (AjDA k ). In fact, if 2 is 
closed under the rules (*) then G^ has a good deal of structure, as can be easily verified. 

Proposition 4.2: If 2 is a set of FD's and u-ID's closed under the rules (*) then G 2 has the 
following properties: 

1. Nodes have red (black) self-loops. The red (black) subgraph of G 2 is transitively closed. 

2. The subgraphs induced by the strongly connected components of G 2 are undirected. 

3. In each strongly connected component of G^, the red (black) edges partition the set of nodes into a 
collection of node-disjoint cliques. 

4. If A 1 ...A n -+A is an FD in 2 and a^...^,, have a common ancestor u in the red subgraph of G 2 , 
then G 2 contains a red arc (u,a). I 

By a topological sort of the dag of strongly connected components of G 2 we can assign to each 
component a unique sec-number, smaller than the sec-number of all its descendant components in the 
dag [2]. Thus every node u in the graph G 2 of Proposition 4.2 belongs to a unique maximal red 
(black) clique and a unique strongly connected component. Let scc(u) denote the sec-number of the 
component of node u. 
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Figure 4-1 illustrates an example of such a graph G^. There jire four strongly connected 
components, each a black clique, with all black arcs present from components with smaller to 
components with larger scc-numbcr. The red cliques and red arcs arc shown explicitly. 

Wc now give a construction which lies at the heart of our completeness proof. 

Lemma 4.1: Let 2 and G^ be as in Proposition 4.2 (i.e., closed under the rules (*)). Let the dag of 
strongly connected components of G^ be topological^ sorted, so that ?ach component has a unique 
scc-numbcr. Wc can construct a finite relation r such that: 

1. The u-FD A->B holds in r iff it is in 2. Also all FD's in 2 hold in r. 

2. The only repeated symbol in each column of r is 0, and the symbols in r[A] arc exactly the integers 
from to |r[A]|-l. Moreover, |r[A]|>|r[B]| iff scc(a)<scc(b) (thus, the u-ID ADB holds in r iff 
scc(a)<scc(b), and all u-lD's in 2 hold in r). 

Proof: First put in r a tuple of all 0's. Process each strongly connected component of G^ in turn, in 
order of increasing sec-number. Begin processing a component by processing in turn each of its red 
cliques. To process a red clique k, add a tuple with all 0's in the columns of the attributes of k and of 
the attributes in all red cliques that are descendants of k in the red subgraph of G 2 . For now leave all 
other positions blank. 

For every red clique k keep a count of the number of 0's in one of its columns (by the way the 
construction proceeds all columns of k have the same number of 0's). Now that one tuple was added 
for each red clique in the component, in order to terminate processing the component repeat certain 
of the tuples just added, so as to make the counts of all cliques in the component equal, and stricdy 
greater than the counts of the cliques of the previous component. This is possible because no red 
clique is a red descendant of another red clique in the same component, or in a component with 
larger sec-number. Once a component is processed, no further 0's are added in its columns and its 
counts no longer change. 

After adding tuples for all red cliques in all strongly connected components, we examine in turn each 
column. If the column has s blank positions, we fill them in with the numbers 1 to s, without any 
repetitions. We illustrate the construction in Figure 4-1. 

Now it is easy to check that conditions 1,2 hold: 
1. No u-FD in 2 was violated during the construction. Furthermore, all u-FD's not in 2 were 
violated. To sec this, observe that if A— >B is not in 2, then the tuple inserted for the red clique of A 
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and the initial tuple of all O's disprove A— >H. 

We must also verify that all non-unary FD's in 2 arc satisfied. Suppose A]...A n -+A is an FD in 2 

violated by r. Since the only repeated symbol in each column is 0, there is a tuple t of r such that 

t[A k ] = 0, k = l,...,n, t[A]>0. Now t was inserted in r while processing a red clique k, so all O's in t 

correspond to attributes that arc functionally determined by every attribute B of k. Since 2 is closed 

under Rules 1,2,3, it follows that B— >A k is in 2, k = l,...,n, and also B-+A is in 2. But then r satisfies 

B-+A, and since t[B] = and there is an initial tuple of all O's, we obtain t[A] = 0, which is a 

contradiction. 

2. By the way r is constructed, the final counts arc strictly increasing with the scc-numbcrs, and are 

equal in all columns of a strongly connected component. I 

Wc will now prove our main result: 

Theorem 4.1: The rules (*) are sound and complete for finite implication of FD's and u-ID's. 

Proof: Wc have already argued for soundness, so it remains to show completeness. Let 2 be a set 
of FD's and u-ID's closed under the rules (*), and let a be an FD or u-ID not in 2. We will exhibit a 
finite counterexample relation r which satisfies 2 but violates a. 

Case 1 (a is an FD): 
If ct is unary, then the relation constructed in Lemma 4.1 is the desired counterexample. If a is not 
unary, wc can use a construction similar to that of Lemma 4.1. In this case the counterexample 
relation is die union of two relations rg,^. 

Let a be X— ► A. The first relation r is a two-tuple relation with one tuple all x's and die other having 
x's only in the attributes that are functionally determined by X in the set 2. The remaining positions 
of this second tuple arc initially left blank. 

The second relation r : contains the symbols 0,1,... (but not x) and is constructed so that the union of 
r and r x has die right number of repetitions of die symbol in r x to satisfy all u-ID's in 2. The 
construction of v l parallels the Proof of Lemma 4.1. The only difference is that now die counts are the 
number of O's and x's in the union of the two relations. When the correct number of blanks have been 
inserted in all columns, i.e. all columns in a strongly connected component have die same count and 
count increases with scc-numbcr, dicn the blanks can be filled in as in die Proof of Lemma 4.1 and all 
u-ID's in 2 are satisfied. 
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Case 2(<j is a u-ID): 
Let a be CDD. Repeat the construction in the Proof of Lemma 4.1, with the following modification: 
if the column for attribute A has s blank positions, fill in the blanks with the numbers 1 to s if there is 
no black arc (a,d) in G 2 ; otherwise, fill in the blanks with l,...,s-l, x. The relation thus constructed 
satisfies the FD's in 2, by the same argument as in the Proof of Lemma 4.1. To see that the u-ID's in 
2 are also satisfied, observe that ADB is violated iff cither 
(i) scc(a)>scc(b), or 

(ii) scc(a)<scc(b), there is no black arc (a,d), and there is a black arc (b,d). 

By the properties of G 2 , this means there is no black arc (a,b), i.e. ADB is not in 2. Finally, it is clear 
that CDD is violated. 
Sec Figure 4-2 for an example of this construction. I 

We remark that Theorem 4.1 leads easily to a polynomial- time algorithm for finite implication of 
FD's and u-ID's [44]. We will now use Theorem 4.1 to prove an analogue of Theorem 2.1, tiiis time 
for finite implication of FD's and u-ID's. The notation is taken from Chapter 2. 

Theorem 4.2: In each of the following two cases, (i),(ii),(iii) arc equivalent: 
FD Case: 

i) 2N fin A^.A^A. 

») E 2 J= fin V r€ ^ (Mf) T[x 1 /a 1 x,...,x n /a n x] = ax. 

iii) e 2 N fin V T er h (M f ) T[x 1 /o 1 ,...,x n /o n l = o. 
u-ID Case: 

i)2N fin BCA. 

ii )E 2 N fin V Te c r f (Mj) aT = bx. 

iii) 6 2 N fin V t€ ^ (M .) T[x/a] = /J. 

Proof: The implications (iii)=>(ii), (ii)=>(i) can be proved by the same argument as in the Proof 
of Theorem 2.1. The reason is that the constructions we give map finite counterexamples to finite 
counterexamples. 

(i)=>(iii): Suppose 2N f]n a, where a is an FD or u-ID. By Theorem 4.1, there is a proof of a from 
2 using the rules (*). Let z be the number of steps of such a proof. We show both the FD and the 
u-ID Cases by simultaneous induction on z. 
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Basis: z = 0. The conclusion is straightforward. 

Induction Step: Wc distinguish six cases, depending on the last rule which was applied to prove a. 

Rules 1.2 Straightforward. 

Rule 3 This means the FD's Aj...A n — *B k , k = l,...,m, B 1 ...R m -+A can be proved from 2 (in less 
than z steps); Rule 3 is then applied to derive Aj...A n — >A. By the induction hypothesis, g^ finitely 
implies V £g^ M j T k [x ] /a 1 x,...,x n /a n x] = b k x, k = l,...,m, and also 6^ finitely implies 

V T £cr+( M j TtXj/bjX x m /b m x] = ax. Thus, g^ finitely implies 

V r,T 1 ,...,T m er^(M f ) rfx/Tjtxj/a^ x n /a n x], ... ,x m /T m [x 1 /a 1 x,...,x n /a n x]] = ax, i.e. 

s 2*=fin V^g* 1- (M f ) T[x J /a 1 x,...,x n /a n x] = ax. 

Rule 4 Straightforward. 

Rule 5 Similar to Rule 3. 

Rule 6 Now the dependencies A — >A lt A{DA2,..., A m _ 1 — >A m , A m DA (m odd) can be proved 
from 2 (in less than z steps); then by a cycle rule we derive Aj— *A . 

Let J. be a finite model of g^. By the induction hypothesis A satisfies p atQ = <x 1 , T 1 a 1 = a 2 ,-, 
Pm-l a m-l = a m< T m a m = a 0' w h cre Pic^^CMj-), T k €?T f (M i ) (we write to as a shorthand for r[x/a]). 
We will show that there is some p' in ^(Mj) such that A satisfies p 'a i = oiq. 

Observe first that A satisfies Po T mPm-l- T 3P2 T l a l = a l (concatenation denotes composition). By 

the commutativity conditions (5) of g 2 , Po T mPm-l- T 3P2 T l = PoPm-i-P2 T m- T 3 T l' so -^ satisfies 

P0Pm-l-P2 T m- T 3 T l Q; l = «1- Now P ut PQPm-1-Pl = P> T m- T 3 T 1 = T ' T m- T 3 T 1«1 = a " 

We now have rai = a, pa = 0^. We will argue from these two equations that there exists some p' in 

^(Mf) such that A satisfies p'^-a. It will then follow, since p m _ 1 ...p 2 a = aQ, that A satisfies 

Pm-l-P2P«l = °0- 

Consider the set K= {p k a 1 : k>0} (p k is p composed with itself k times). Since A is finite, K is 
finite, and therefore there exists a least integer q such that p^a^p^a^, for some s greater than q. We 
will first argue that q = 0. Assume on the contrary that q>l. By commutativity, 
rp q a 1 = p q Tai-p q a = p q ' pa = p q ~ aj, and similarly Tp s a 1 = p s_1 a 1 . But this means 
p q ~ a± = p s ~ <*!, which contradicts the choice of q. 
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Since q~0, X satisfies «i=p'«i, where sX>. But now a*Tat 1 =Tp'« i 3sp , Ta 1 sp , "V«=P** 1 «i. >•«• 
.4. satisfies p*" 1 ^ = a. This conctudestnc proof. 



If a cycle rale is applied to derive a iHD, we aifue in an 
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Chapter Five 
Partition Dependencies 



5.1 Preliminaries 

Let D be a database scheme containing a single relation scheme RfSl], C U = {A 1 ,...,A U }. We can 
express database constraints as formulas of first-order predicate calculus with equality [32]. These 
formulas have a single relation symbol R of arity u which represents the relation R, and no function 
(or constant) symbols. 

Specifically, let us call atomic formulas of the form Rx^.Xy relational formulas and atomic 
formulas x = y equalities. A formula is typed iff there arc disjoint classes (types) of variables such that 

1. if Rxj-.Xy appears in the formula, then x k is of type k, k = l,...,u, and 

2. if x = y appears in the formula, then x,y have the same type. 

Definition 5.1: An embedded implicational dependency (EID [34]) is a typed sentence of the form 
V Xl ...x p . [(<p i A...A<p n )=*3y 1 ...y q . (^ 1 A...A^ m )], 
where each qp k is a relational formula, each ^ k is either a relational formula or an equality between 
two of the x k 's, and each of the x k 's appears in one of the qp k 's. 

Example 5.1: 

(a) Let «U = {A^AjAB}- The FD AjA 2 ->A can be expressed as die EID 
Vx^xyx'y'. [(Rx 1 x 2 xyARx 1 x 2 x'y')=>x = x']. 

(b) Let C U = {A,B,C}. The MVD A-^-^B [62, 51] is equivalent to the EID 
Vzxyx'y'. [(RzxyARzx'y')=>Rzxy]. 

Now let r be a relation over a finite universe of attributes C U, and let a be an EID. As one can 
easily observe, to decide whether rNa we do not need to know the particular values appearing in r, 
but only the equalities between tiiesc values. As a matter of fact, all that is relevant about two tuples 
t,s of r is the set of attributes on which they agree. We can capture this information formally by 
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considering, for each attribute A in 11, the partition 77 A which is induced on the set of tuples of r by 
the values of r in column A: two tuples t,s of r arc in the same block of 77 A iff they agree on A. The set 
{7y A I AeH} characterizes the HID's satisfied by r. 

Although the above observation docs not seem to take us very far regarding general EID's, it docs 
lead to an elegant algebraic formulation of FD's [15, 60, 27]. Recall that partitions have a natural 
partial order <, and two natural binary operations •, + : Given partitions 77, it 'of a set S, 
77 < 77 ' iff for every block x of 77 there is a block x' of 77 ' such that xQx'. 

TT m Tr'={x\ x=yC\z*0,yd-n, z£-n'}. 

■n + m'={x I a,b€S arc in xiff there is a sequence x ,...,x n such that 
.^€7^77' for i = 0,...,n,a€x , b€jr n , aY\&xf\x [+l *0 for i = 0,...,n-l} 

Notice that 77 • 77 ' is the coarsest common refinement ofm,ir' (in the sense of <) and 77 + 77 ' is their 
finest common generalization. Also •,+ are associative, commutative and idempotent (cf. Section 5.3). 

With the above remarks, it is easy to see that an FD such as AB-+CD holds in relation r iff 

,r A , ' r B< T c' 7 'D 

or, equivalently, 

77 A '77 B =77 A '77 B '77 C »77 D 

or, still, 

Thus, FD's can be expressed equationally using product and sum of partitions. It is then natural to 
investigate the expressive power of general equations one can write using •, + . 

Definition 5.2: 

a. The set of partition expressions over <U, WCU), is the least set satisfying the following closure 
conditions: 

LACWOUXforAinU 

2. Ifcc'eWCU), then (c»c'), (e + e')are in W( C U). 

(•,+ are meant here as uninterpreted operator symbols) 

b. A partition dependency (PD) is an equation c = e', where e.e'GWCil). 
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The above definition gives the syntax of PD's. The semantics of PD's arc given below: 

Definition 5.3: 

a. Let r be a relation over 11, S the set of tuples of r. For A in 11, 

7t a = {jc | t,s€S arc in x iff t[A] = s[A]}. 
Then L(r) is the set obtained by closing { it a | A€1l} under product and sum of partitions. 

b. Let e£W(1l). The meaning of c in L(r), /i r (c), is defined inductively as follows: 

1. /i. r (A) = w A , A in 11. 

2./i,<c«c') = /i r (c)«/i r (c'), 
jn r (e + e') = /x r (e)+/i r (e'). 

Relation r satisfies a PDe = e' (notation: rf=e = e') iff /i r (e) = /i r (e'). 

Observe diat L(r) is actually a lattice [28], generated by the set {w A | A€1l}. As a matter of fact, 
rl=e = c'iff L(r) satisfies die equation c = e' (with A interpreted as w A , A€1l). 

From Definition 5.3, we see that we can use die formalism of PD's to express an FD AB— *CD as 
the PD A«B = A-B«C'D. Clearly rNAB-*CD iff rh=A«B-A«B'OD (here and in die sequel we 

omit parentheses from PD's wherever possible, for the sake of clarity). Partition dependencies of the 
above form, which are equivalent to FD's, are of special interest; we call them FPD's. 

In the remainder of this Chapter, we investigate various questions concerning PD's. Section 5.2 
deals with the expressive power of PD's, and compares PD's to EID's from this point of view. In 
Section 5.3 we give a polynomial-time algoridim for the implication problem for PD's. Finally, in 
Section 5.4 we present a polynomial-time test for consistency of a database with a set of PD's. 



5.2 Expressive Power 

We want to study what properties of a relation r can by expressed using sets of PD's. From the 
definitions of *,+ and Definition 5.3 it it easy to sec the following: 

1. ri=C = A*B iff for any tuples t,s€r, 
t[C] = s[C] iff t[A] = s[A] and t[B] = s[B]. 

2. rNC = A + B iff for any tuples t,s€r, 
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t[C] = s[C] iff there is a sequence s s n of tuples of r with t=s , s n = s, and for 

i = n-l, Si [A] = s i+1 [A] or Si [B] = s i+ j[B]. 

From observation (2) above, one sees that symmetric transitive closure can be expressed by a PD, 
as follows: 

Example 5.2: Consider a relation r representing an undirected graph. This relation has three 
attributes: head, tail and component. For every edge {a,b} in the graph we have in the relation 
tuples abc, bac, aac, bbc, where c is a number which could vary with {a,b}. These are the only tuples 
in r. Wc would like to express that: for each tuple t of r, t[cOMPONEN'l] is the connected component in 
which die arc (t[ni'AD], t['i'Ail.]) belongs. Wc can do this by insisting that r satisfies the PD 

COMPONENT = HEAD + TAIL. 

We now want to compare the expressive power of PD's to that of previously studied database 
constraints, namely EI Us [34]. Let us say that an EID a is expressed by a set E of PD's iff for any 
relation r, rf=a iff rNE. From the algebraic properties of •, the PD C = A*B is equivalent to 
C = OA*B A A*B = OA'B, and therefore it is expressed by the set {C-+AB, AB— +C}. However, 
because of Example 5.2 above it should come as no surprise [4] diat the PD C = A + B cannot be 
expressed by any set of EID's: 

Theorem 5.1: Let Hi = ABC; the PD C = A + B cannot be expressed by any set of first-order 
sentences. 

Proof: Let 2 be a set of first-order sentences (with a single ternary relation symbol R as the only 
non-logical symbol) which expresses C = A + B. For k>l, let <p k be the following first-order formula, 
with free variables t,s: 

"t[C] = s[C] and there is no sequence s ,...,s k such that t=s , s k = s, and for i = 0,...,k-l, 
s i [A] = s i+1 [A]ors i [B] = s i+1 [B]" 

(it is easy to see how to write <p k without tuple variables). Observe that the relation r in Figure 5-1 
(with t,s as indicated) is a model for 2U{(jp k }: rl=C = A + B so rN2, and clearly rh=<p k . Thus, any 
finite subset of 2'=2U{<p k : k>l} has a model, and thus by the Compactness Theorem [32] 2 'has a 
model, say r'. But this is a contradiction, since r' satisfies 2 and duis r' satisfies C = A + B, and on the 
other hand r't=(p k yora//k>l and therefore it does not satisfy C = A + B. I 



67 



On the other hand, an KID as simple as an MVD cannot be expressed by PD's: 

Theorem 5.2: Let C U = ABC; the MVD A— »— >H cannot by expressed by any set of PD's. 

Proof: Let K be a set of PD's which expresses A— ►— >B (see Kxamplc 5.1 for the meaning of this 
MVD). Referring to Figure 5-2, relation r { satisfies A— ►— >B, so l,(r])t=E. On the other hand, 
relation r 2 docs not satisfy A— »— >IJ, so L(r 2 ) docs not satisfy K. But this is a contradiction, because 
L(rj), L(r2) arc isomorphic, and thus they satisfy exactly the same PD's. I 



5.3 The Implication Problem 

Given a finite set K of PD's and a PD 5, wc want to know if Eh=6\ i.e. if 5 holds in every relation 
that satisfies E. Wc also want to know if Eh= fin 6\ i.e. if 5 holds in every finite relation that satisfies 
E. Wc first observe diat these questions can be approached as implication problems for lattices. 

Lemma 5.1: 

a. Et=5 iff El= la[ 5, i.e. iff 8 holds in every lattice that satisfies E. 

b. Kt= rin S iffEN^^S, i.e. iff S holds in every finite lattice that satisfies E. 

Proof: 

a. (<=): Suppose El= lal 5, and let r be a relation that satisfies E. Then L(r)N=E, so S holds in L(r), and 
thus r satisfies 5. 

(=>): Suppose EN 6, and let L be a lattice satisfying E. By the Representation Theorem for 
lattices, [28, 66], wc may take the elements of L to be partitions of some set X. Thus, each A in 11 is 
interpreted in L as a partition tt a of X (and, of course, •,+ in L are partition product and sum 
respectively). Now consider a relation r over °U containing a tuple tj for each element i of X (these are 
the only tuples in r), where tj[A] = tj[A] iff i,j are in the same block of ir A , A in C U. Clearly r satisfies 
exactly the same PD's as L. Thus rl=E, so by the hypothesis rr-=5, and therefore LN5. 

b. (<=): Observe, in the proof of the "if direction of (a), that if vis finite then L(r) is a\so finite. 

(=>): Observe, in the proof of the "only if direction of (a), that if L is finite then the set X can be 
taken to be finite, by die Representation Theorem for finite lattices [56]. Then the relation r is also 
finite. I 



68 



Now KN] al S can be viewed as a (uniform) word problem, since a set with two binary operations 
•,+ is a lattice iff the following set of axioms (LA) is satisfied [28]: 

l.x + x = x,x'x = x (idempotency) 

2. x + y = y + x, x*y = yx (commutativity) 

3. x + (y + z) = (x + y)+z, x*(y»/.) = (x'y)v. (associativity) 

4. x + (x*y) — x, x*(x + y) = x (absorption) 

I.e., Kt= lal 5 iff 8 is implied from K U LA. We are going to show that N lat fin is equivalent to N= lat , 
so l= lat fin can also be viewed as a word problem. 

In particular, let S a be the FPD corresponding to an FD a (S a is A = A*B if a is A-+B), and let 
E^ be the set of FPD's corresponding to a set of FD's 2. Since rNa iff rN5 CT , ENct iff E^\=S a . 
Thus, the implication problem for FD's can be reduced, in a straightforward way, to the (uniform) 
word problem for idempolent commutative semigroups (structures with a single associative, 
commutative and idempotent operator). On the other hand, since X = Y is equivalent to X = X»Y A 
Y = Y*X, we can also reduce the above word problem to the implication problem for FD's. 

We now present a polynomial-time algorithm for the (finite) implication problem for PD's. 
Suppose we are given a set E of PD's, and a PD c = e' : by Lemma 5.1, it suffices to test if EN= laI e = e' 

( Ef= iat,fine = e'). 

Consider the set W( C U) of partition expressions over C U, •, + : we define several binary relations on 
W( C U). First, define < id (;Je/2//ctf//>'less-than-or-cqual) inductively as follows: 

1. A< id A, A in U 

2. if p< id r, q< id r then p + q< id r. 
3- if p< id r o^q< id r then p*q< id r. 

4. if r< id p, r< ld q then r< id p»q. 

5. if r< id p or r< id q then r< id p + q. 

(The intended meaning of < id is that p<j d q iff every lattice satisfies p<q, no matter how the A's 
in Ri are interpreted). 
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The relation < id is reflexive and transitive [28,65]. Also, if Pi<j d qi, P2^idQ2' t ' 1cn 
Pl + P2<idqi+ c l2 and PrP2<id c ]r c l2- 

Now define = id as follows: p ^ icl q iff both p< id q and q<j d p. 

The relation = id is an equivalence relation, and in particular it is a congruence: i.e., if Pi = idQi« 

P2 = id c l2' ti lcn Pi + P2 — id^l + ^2 ar, d Pj "P2 = id c ll* t i2- ' nus > onc can define '.+ on me set °f 
equivalence classes of = id . The structure obtained this way is a lattice [28, 65]. 

We now capture the effect of E. Define the following relation -+—►].- on W(1L) : p— +— > E q iff q can 
be obtained from p as follows: for i = 0,...,n, substitute w t for some (zero or more) occurences of Zj, 
where z- t = Wj (Wj = Zj) is in E. It is easily verified that — ►-~+ E is a congruence. 

Now define < E as the sum of < id , — *-+ E : p< E q iff there is a sequence of expressions s ,...,s n 
such that p = s , s n = q, and fori = 0,...,n-l, s s < id s i + 1 ors i — f— ^E^+l- 

It is easy to sec that < E is reflexive and transitive. Also if Pi< E qi, P2<eQ2' tnen 
Pl+P2^=E c h + <: l2 anc * PrP2^E c ll* c l2 (because both < id and — ►— > E have this property [36]). 

Finally, define = E as follows: p = K qiff both p< E q and q<gp. 

The relation = E is an equivalence relation, and moreover it is a congruence. One can further 
observe that the equivalence classes of = E form a lattice L E under the induced •, + : just check the 

axioms LA, e.g. p + p = i^) because p + p = j d p, and in general if p = id q then p = E q. Note that L E 
satisfies a PDp = q iff p = F q (AG'U is interpreted in L E as the equivalence class of A). 

We now show that the relation = E captures the PD's (finitely) implied by E: 

Lemma 5.2: The following statements are equivalent: 
a.e = E e' 
b.EN tal c = e* 

C - El== l a t,f.n e = e ' 

Proof: Observe that, from the way < id and < E were defined, if e< E e' then e<e' in every lattice 
satisfying E (where < is the partial order of die lattice). Thus, (a)=>(b). To prove (b)=>(a), recall 
that L E satisfies a PD p = q iff p = E q. Thus, if e^e'then L E docs not satisfy e = e', whereas it satisfies 
E; i.e., L E is a counterexample to Eh= lal e = e'. 
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Wc now show the equivalence of (b),(c). The direction (b)=>(c) is obvious. To prove the converse, 
we adapt an argument of [30] (sec also [28]), originally given for the special case E=0. 

Suppose K docs not imply c = c' ; wc will show that there is a finite lattice which satisfies E but 
violates c = c'. Let { A; | i= l,...,n} be the set of attributes appearing in K,c,c', and let V be the set of all 
partition expressions (over the Aj's) of complexity at most as high as the maximum complexity of e,e' 
and the expressions in E (complexity can be measured by the number of instances of •, + ). Note that 

V is finite, since E is finite. 

Consider now the subset L of L H consisting of all finite products of the equivalence classes (under 
= j, : ) of elements of V, together with the equivalence class of Aj + .-.T A n . It is not hard to verify that 
L is a sublaltice of L L , ; . But by the equivalence of (a),(b) c^ E c', so L satisfies E and violates e = e'. 
Since I. is also obviously finite, we are done. I 

We can now prove our main result: 

Theorem 5.3: There is a polynomial-time algorithm for the (finite) implication problem for PD's. 

Proof: By Lemmas 5.1, 5.2, it is sufficient to describe a polynomial-time algorithm to test, given 
E,e,e' whether c< E e'. 

Let V be the set of all subexpressions of c,e', and of the expressions appearing in E. The following 
algorithm constructs a set T of directed arcs over V such that, whenever (p,q)€f, p<jdq or P-*~ *rQ'- 
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begin 

repeat until no new arcs arc added 
1. AddfA.AJ.AG'U 
2.if(p,r)€r,(q,r)€r,p + q€V 

then add(p + q,r) 
3. if (p,r)€r or (q,r)€I\p»q€V 

then add(p*q,r) 
4.if(r,p)€r,(r,q)€r,p«q€V 

then add(r,p*q) 

5. ir(r,p)€r or(r,q)€I\ p + q€V 
then add(r,p + q) 

6. Add (z,w),(w,z), where z = w in E 
7.if(p,r)€r,(r,q)€r 

then add (p,q) 
end 
end 

Observe that Steps 1-5 in the above algorithm mirror the definition of < id . 

We will now prove the following 

Claim: For p,q€V, p< E q iff (p,q)€I\ 

Clearly, die llieorem follows from the Claim: to test if e< H e', construct the digraph (V,r) and 
check if it has an arc from e to e'. This can be done in polynomial time. 
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Proof of Claim: 
(<=): Straightforward. 

(=> ): Wc first give a set of rewrite rules [41] for < E : 

1. X + X-+-+X 

2. x»y— ►— +x 

3. y # x— +— >x 

4. x— +-+x»x 

5. x— +— »x + y 

6. x— »— ►y+x 

7. z— ►— >w, where z = w (w = z) is in E 

Observe, regarding Rules 5,6, that y can be an arbitrary expression. 

An easy induction shows that, if p< id q, then p can be rewritten as q using Rules 1-6. By the 
definition of < E , if p< E q then there is a sequence of expressions s ,...,s n such that p = s , s n = q, and 
for i = 0,...,n-l, Sj— ►— +s i+i , i.e. s i + ] is obtained from s, by rewriting a subexpression of s ; according to 
one of the Rules 1-7. We call such a sequence a proof that p< E q. 

Now we define a relation -< on pairs of expressions: 

(Pl.Qi)^<(P2' < 92) iff Pi<Ell' P2<e92> and either 
(i) the shortest proof that Pi< E qi is shorter than the shortest proof that P2<n ( l2' or 
(ii) the shortest proofs that Pi< E qi, P2<e c I2 nave me samc length. ar) d p± is a proper subexpression 
of p 2 , q^ is a proper subexpression of q 2 . 

Clearly -< is well-founded. We proceed by induction on -<. 

Basis: There is a proof that p< E q of length 0. Then p is identical to q, and (p,q)€I\ 

Induction Step: Let p,q€V, and assume that the Claim holds for p',q'€V whenever (p',q')~<(p,q). 
We will show that the Claim holds for (p,q). Let s ,...,s n , n>0, be a shortest proof that p< E q. 
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Case 1: For i = n-1, s i+J is obtained from Sj by rewriting a proper subexpression of Sj 

according to Rules 1-7. Then p = P!0p 2 , q = q!0q 2 (#£{•, + ]), where Pi<]fli via proofs at most as 
long as the proof that p< F q, and Pj (qj) is a proper subexpression of p (q). Thus (p j ,q j )—<(p,q), and 
furthermore P;,q;6 V, so by the induction hypothesis (pj,qj)6r. It then easily follows that (p,q)€T. 

Case 2: For some i, 0<i<n-l, s, is rewritten into s i+i according to one of the Rules T7. 

Case 2a: For some i as above, the Rule used is Rule 7. This means p is rewritten to z, z = w (w = z) 
is in E, and w is rewritten to q. Then clearly (p,z)-<(p,q), and since z€V, by the induction hypothesis 
(p,z)€r. Similarly (w,q)€I\ It follows that (p,q)€T. 

Case 2b: For any i as above, the Rule used is one of the Rules 1-6. We consider the least such i, 
and we distinguish cases according to which Rule was used to rewrite s ; to s i+1 . 

Rule 1 This means p = P} + p 2 , Pi rewrites to r, p 2 rewrites to r, and r rewrites to q. Then Pj<r£q 
via proofs shorter than the proof that p<j;q, so (p i ,q)-<(p,q). Also PjCV, so by the induction 
hypothesis (pj,q)€r. It follows that (p,q)€T. 

Rule 2 This means p = p 1 *p 2 , p^ rewrites to r, r rewrites to q. Then Pi<]<q via a proof shorter than 
the proof that p< E q, so (p 1 ,q)-<(p,q). Also P}€V, so by the induction hypothesis (p 1 ,q)€T. It follows 
that(p,q)€r. 

Rule 3 Similar to Rule 2. 

Rule 4 Now p rewrites to r, and Rule 4 rewrites r to r*r. Observe that the expression r*r will not be 
rewritten subsequently using Rules 2,3, because in that case we could shorten the proof that p<eQ 
(however, either subexpression of r*r may be rewritten). Moreover, if at some later point Rule 5 is 
applied to rewrite the whole expression Sj as Sj + y, then s ; +y will not be rewritten subsequently using 
Rule 1. Thus, the expression q eventually obtained is built up, using Rules 4,5,6, by some expressions 
ij, j = l,...,m, such that r rewrites to n for all j, and by some completely new expressions y k , k = l,...,m', 
which were introduced by Rules 5,6. Now clearly (p,ij)-<(p,q) and nCV, so by the induction 
hypothesis (p,rj)G V. It then follows by an easy induction on the structure of q that (p,q)€T. 

Rules 5.6 Similar to Rule 4. 

This concludes the Proof of the Claim, so we arc done. I 
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Since inference of FD's can be seen as a special case of inference of PD's, the problem is actually 
polynomial- lime complete [63]. However, in the special case where H is empty [28, 65] it can be solved 
in logarithmic space [40], as we now outline. By Lemma 3, it suffices to describe how to recognize < id 
in logarithmic space. 

First, observe the following: 

1. A< jd A'iff A is identical to A', A.A'in 11. 

2. A< id p'*q'iff A< id p'a/u/A< id q', A in H. 

3. A< id p'+q'iff A< id p'or A< id q', A in 11. 

4. p*q<i d A'iffp< id A'orq< id A', A' in 11. 
5- P"q< id p'*q'iffp-q<i d p'a«£/p'q< id q'. 

6. p , q<idP'+q'iffp<idP'+ c i' o '-q< 1 dP'+q' cp , q<idP' or P"q^id < i'- 

7. p + q< id e'iff p< id e' and q< id e. 

In each of the above cases, the "if direction is trivial. The "only-if direction follows in Case 5 

because 

p'*q'< id p'and p'«q'< id q,'and in Case 7 because p< id p + q, q<j d P + q- In the remaining cases, the 

"only-if direction follows by the definition of < id . 

The above observation gives a recursive algorithm to test, given e,e\ whether e< id e\ We now 
describe how to implement this recursion using only logarithmic auxiliary space. 

First, note that the results of intermediate recursive calls need not be stored. For example, 
consider Case 7: if the recursive call for p<; d e' returns false, then we immediately return false; 
otherwise, we return the result of the recursive call for q<j d e'. 

We will also argue that we do not need to store the arguments of previous recursive calls. Thus, all 
we need to have in storage at any particular point is the arguments of the recursive call which is being 
evaluated. Since these arguments arc subexpressions of e,e', we can just have two pointers to the 
appropriate places in the input, and this only takes logarithmic space. 

We will now describe how, given two pointers to two subexpressions p,p' of e,e' respectively, we 
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can find the next recursive call to be evaluated, using only logarithmic additional space. Wc assume 
that c,c' arc represented (in the standard way) as binary trees, so that, given a pointer to a node u, we 
can find a pointer to the father (right son, left son) of u. 

We use two auxiliary pointers a,a\ initialized to the root of e,c' respectively. Let Cfec') be the set of 
recursive calls generated from the call e< id e' (C(e,c') contains cither two or four members, 

depending on which of Cases 2-7 is the relevant one). We will show that we can determine which 
member of C(c,c') eventually gives rise to the call p<j d p', using only logarithmic additional space. If 
this member of C(c,e') turns out to be the call C]< id ci, wc set the pointers a,a' to the expressions 
Cj,c{ respectively and wc repeat with C(e h c[). Continuing in this way, wc will eventually find Cj.ej' 
such that the call p< id p' is in C(e i ,c i '). We can then easily determine die next call to be evaluated. 

Finally, note that, to determine which member of C(c,c') eventually gives rise to the call p<j d p', 
wc only need to know whether p (p') is in the left or in the right subtree of c (e'). This can be found 
be walking the tree representing e in a depth-first fashion, until we encounter p. This walk can be 
done using only logarithmic additional space, because all wc need to remember is the node v which is 
currently visited and the node w which was visited immediately before v: if w is the father of v, we 
next visit the left son of v; if w is the left son of v, wc next visit the right son of v; if w is the right son 
of v, wc next visit the father of v. 



5.4 Testing Satisfaction 

Given a database d over HI and a set of PD's E, wc want to test if d is consistent with E, i.e if there 
is a weak instance w for d satisfying E. Recall that a relation w over <\i is a weak instance for d iff 
every tuple of relation R[U] of d appears in the projection of w on U. Weak instances have been 
proposed as a way to model incomplete information in databases [38, 64]. Given a database d and a 
set of FD's E, we can test if d has a weak instance satisfying E in polynomial time [38]. We now show 
how this test can be generalized to arbitrary PD's. 

First, we replace E by a set E' of PD's of the form C = A»B or C = A + B, where A,B,C are 
attributes from a universe C U' containing HI: this is done by (recursively) replacing X = Y'Z by the 
PD's X = C, Y= A, Z = B, C = A*B, where A,B,C are new attribute names. It is easy to check that 
there is a weak instance for d satisfying E iff there is a weak instance for d satisfying E'. 
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Let us denote by p— »q, where p,q are partition expressions, the PI) p = p*q. This slight abuse of 
notation is consistent, since the FPD X— >Y is actually equivalent to the FD X— »Y. Now a PI) 
C = A«B in F' can be replaced by the FPU's C-»AB, AB->C, and a PD C = A + B in E' can be 
replaced by the PD's A + B— >C, C-+A + B. Furthermore, the PD A + B-+C can be replaced by the 
FPD's A— »C, B— »C. We now have a set F consisting of FPD's and of PD's of the form C-+A + B, and 
it is obvious that there is a weak instance for d satisfying E" iff there is a weak instance for d satisfying 
F. 

Now compute (using the algorithm of die previous Section) all consequences of F of the form 
A— ►]}, A,B in C U\ and add them to F. Furthermore, if now F contains A— >B and C— +A + B, replace 
C— *A + B by C-+B. Let F' be the set of FPD's in F. The crucial fact is given in the following 

Lemma 5.3: There is a weak instance for d satisfying F iff there is a weak instance for d satisfying 

f: 

Proof: The "only if direction is obvious. For the converse, let w be a weak instance for d 
satisfying F.' Suppose some PD C— >A + B in F is violated by tuples t^ of w, where tJABC] = a 1 b 1 c, 
t 2 [ABC] = a 2 b 2 c, a^a^ b^b^ We can remedy this violadon by adding to w a tuple s such that 
s[AB] = a 1 b 2 . To make sure diat the relation w l obtained still satisfies F', let A + = {X | F't=A— ►X}, 
B + ={X | F'NB->X}: we make s[A + ] = t 1 [A + ], s[B + ] = t 2 [B + ], and fill in the rest of the attributes of 
s with distinct new values (not appearing in w). To argue tiiat this is indeed possible, observe first that 
B is not in A + and A is not in B + (otherwise C— >A + B would not appear in F). We also have to 
make sure that, if Q€A + and Q€B + , then tj[Q] = t 2 [Q]. But if Q appears in both A + and B + we 
have F'NA— »Q, F'NB— >Q, so since C— >A + B is in F we have Ft=C— >Q, and therefore C— >Q is in 
F". This implies that t^Q] = t 2 [Q], since t x [C] = t 2 [C] and w satisfies F". 

We now repeat the above argument, starting with w lf to obtain reladons w 2 , w 3 and so on. The 
relation w u obtained after an infinite number of steps is a weak instance for d satisfying E', because 
any violation of some PD C-+A + B appearing at any stage has been taken care of at some later stage. 
I 

We can now prove the main result: 

Theorem 5.4: There is a polynomial-time algorithm to test whether a given database d is consistent 
with a set E of PD's. 
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Prooft Using the polynomial-time algorithm for inference of PD's given in Section S3, we can 
construct die set F'. By Lemma 5 J, we cam men use the i%»iimaf(3l| to te^ if d is consistent wkh 
F*. I 

Observe that the weak instance constructed in the Proof of Lemma 33 is in general b^ftnite. The 
problonoftestingcatfatenceofa^ii^weakmttancetaoflen. 
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Figure 5-2: MVD's are not expressible by PD's 
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Chapter Six 
Directions for Further Investigation 



Extending the Equational Approach 

Of course, the most obvious question is whether our cquational formulation of FD's and IND's 
can be extended to more general dependencies. We outline some partial results we have at this point, 
which indicate that such an extension is indeed possible. 

Recall that an embedded implicational dependency (KID) is a typed sentence of the form 
V Xl ...x p . [((p 1 A...A( Pn )=*3y 1 ...y q . OhA...Ai// m )], 
where each q? k is a relational formula, each >^ k is either a relational formula or an equality between 
two of the x k 's, and each of the x k 's appears in one of the <p k 's (cf. Section 5.1). If all the ip k 's are 
relational formulas, we have a tuple generating dependency (TGD); if all the ^ k 's are equalities, we 
have an equality generating dependency (EGD) [10, 11, 34]. 

Every EID is obviously equivalent to the conjunction of a TGD and an EGD. Furthermore, it can 
be shown that every EGD is equivalent to a conjunction of FD's and TGD's [11]. The question then 
is whether we can have an cquational formulation of FD's and TGD's. 

Let C U = {A,B,C} and consider the MVD A— ►— >B (cf. Example 5.1). We can formulate it as the 
sentence 

Vx lX2 . [a(x 1 ) = o(x 2 )=>3y. (a(y) = a(x 1 )A6(y) = 6(x 1 )Ac(y) = c(x 2 ))]. 
Here x 1 ,x 2 ,y are variables ranging over tuples; see Section 1.3. Now Skolcmization suggests 
transforming this MVD into an cquational implication 

ax x = ax 2 =>(aixjx 2 = ax l Abix 1 x 2 = bx 1 Acix 1 x 2 = cx^ 
In this way, we can transform any TGD into an cquational implication. In fact, we can even relax the 
typedness restriction, to obtain a class of constraints which properly includes IND's: specifically, it 
suffices if only the part of the sentence consisting of the <p k 's is typed. 

We can go even further and transform these cquational implications into equations. We illustrate 



81 



how this is done with the implication 

ax 1 = ax 2 =>aix 1 x 2 = ax 1 . 
This can be transformed into the set of equations 

aix l x 2 = f a x l x 2 ax l ax 2 
IrtXiXnXX — axi, 

where f a is a new function symbol of ARITY 4. 

The above cquational formulation of TGD's can be used to prove a generalization of Theorem 
2.1, for implication of TGD's from FD's and TGD's (i.e., we actually generalize the IND Case of 
Theorem 2.1). The proof uses the same ideas as the proof of Theorem 2.1. Unfortunately, the proof of 
the FD Case docs not generalize, because the inductive argument for die completeness part depends 
critically on the fact that Skolcm functions have only one argument (which only happens in the case 
oflND's). 

Designing Normal Form Schemas 

An active area of research in logical database design is concerned with canonical representations 
of the database schema, which avoid potential update anomalies (i.e. updates that can result in 
inconsistent data), and minimize data redundancy. Several such representations have been proposed 
and analyzed, assuming that die only integrity constraints of the database schema are FD's. The 
general idea is that the database schema should be in a certain normal form [22, 7, 62, 51], i.e. certain 
restrictive conditions should be satisfied by the FD's of the schema and tiicir logical consequences. 
Given a universe *\1 of attributes and a finite set 2 of FD's, one can construct a database schema 
satisfying such restrictions [12, 6]. These algoridims are based on efficient solutions of the implication 
problem. 

An interesting question is to investigate normal forms in die presence of FD's and IND's (cf. [33]). 
Eventually one would hope to extend the known schema synthesis algorithms to incorporate IND's of 
some restricted form (for example, unary IND's). The insights we have gained on die implication 
problem can potentially be useful for this investigation. 

Query Equivalence in the Presence oflND's 

The problem of optimizing queries has received a lot of attention, because of its central role in all 
relational database implementations [62]. Given a query Q, the goal is to design an equivalent query 
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Q' which can be processed as efficiently as possible (i.e. contains a minimum number of instances of 
expensive operators, such as join). Since equivalence of two queries is a data dependency, the 
problem of testing equivalence of queries in the presence of dependencies can be approached with 
the standard tools for implication problems [3, 18, 62]. 

The equivalence of relational database queries in the presence of FD's and IND's has been 
examined in [43, 48], essentially by extending classical techniques (namely the chase). The authors of 
[43] show that under reasonable restrictions on the IND's, query equivalence can be reduced to well- 
understood cases involving only FD's. The approach of [48] is to introduce the weak instance 
assumption [38, 64]; under this restriction, query equivalence in the presence of FD's and typed 
IND's can be handled by the methods of [43]. 

Many questions remain unanswered in the area and new techniques seem to be required to handle 
major new cases. The techniques we have developed for FD and IND implication may be useful in 
this respect. In particular, it would be interesting to see if the tools we provide for typed IND's can 
be used to study equivalence of (typed) conjunctive queries [18, 43] in the presence of typed IND's 
and FD's, without the weak instance assumption of [48]. 

Expressing Data Distribution 

An important consideration in the context of distributed databases is to find ways to preprocess 
relations stored at different sites, so that a given query can be processed with a minimum amount of 
data communication between sites. Some work has already been done on characterizing database 
schemes and queries for which such preprocessing is possible [8, 13]. An interesting research direction 
is to extend these results to allow for the presence of FD's (conceivably we will be able to preprocess 
more queries if the database is constrained to satisfy a set of FD's). Since data distribution can be 
modeled by IND's, these questions can be approached as implication problems involving FD's and 
IND's. 

Performance of Equational Theorem Provers 

An interesting practical question is how well theorem provers designed around the Knuth-Bendix 
method [46] perform on sets of equations obtained from database constraints. We have experimented 
with the revk system [35, 49], which has been able to handle various non-trivial inferences of FD's 
and IND's. However, more work needs to be done in this direction. 
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